Read Dataframes

A dataframe is a specialization of an array (see Use Cases). As such, any TileDB API works natively for writing to and reading from a dataframe modeled as an array. However, Python Pandas has a popular offering for dataframes in main memory and, therefore, TileDB offers special optimized reading functionality to read directly from an array into a Pandas dataframe. This How To guide describes this functionality.

Sections Create Dataframes and Write Dataframes describe how to ingest a dataframe into a 1D dense or a ND sparse array. This section covers how to read from the ingested dataframes directly into either Pandas or Arrow Table. Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.

Read the Dataframe Schema

Since the dataframe is an array, you can read the underlying schema in the same manner as for arrays as follows:

A = tiledb.open("my_array", mode="r")
A.schema

Read From A Dense Dataframe

Suppose you have ingested a CSV file into a 1D dense array.

To find out how many rows were ingested, you can take a look at the array non-empty domain:

A = tiledb.open("my_array", mode="r")
A.nonempty_domain()
# Example ((0, 7667791),)

To read data from an array into a Pandas dataframe, you can use the df operator:

A.df[:]

For dense arrays, this operator allows you to efficiently slice any subset of rows:

A.df[11:20]

TileDB is a columnar format and, therefore, allows you to efficiently subselect on columns / attributes as follows:

A.query(attrs=['attr1']).df[:]

Read From A Sparse Dataframe

Suppose you have ingested a CSV file into a 2D sparse array.

This array allows for efficient slicing on the two dimensions as follows:

# If both dimensions are integers
A.df[1:10, 1:100] 

# Or, natively on the datatype of the dimensions (e.g., datetime)
A.df[slice(np.datetime64("2019-01-01 00:00:00"), np.datetime64("2019-01-02 23:59:59")), 0:10]

You can prevent the Pandas dataframe from materializing the index columns (which will boost up reading performance) as follows:

A.query(index_col=[]).df[1:100, 0:10]

You can check the non-empty domain on the two dimensions as follows:

A.nonempty_domain()

Being a columnar format, TileDB allows you to efficiently subselect on attributes and dimensions as follows:

A.query(attrs=['attr1'], dims=['dim1']).df[:]

Read into Arrow Tables

If you are using Apache Arrow, TileDB can return dataframe results directly as Arrow Tables with zero-copy as follows:

A.query(return_arrow=True).df[:]

Read Using SQL

TileDB supports SQL via its integration with MariaDB. A simple example is shown below, but for more details read section Embedded SQL.

import tiledb.sql, pandas

db = tiledb.sql.connect()
pandas.read_sql(sql="select * from `<array_uri>`", con=db)

PreviousWrite Dataframe From CSV NextGroups

Last updated 11 months ago

Was this helpful?

Read Dataframes

Read the Dataframe Schema

Since the dataframe is an array, you can read the underlying schema in the same manner as for arrays as follows:

A = tiledb.open("my_array", mode="r")
A.schema

Read From A Dense Dataframe

Suppose you have ingested a CSV file into a 1D dense array.

To find out how many rows were ingested, you can take a look at the array non-empty domain:

A = tiledb.open("my_array", mode="r")
A.nonempty_domain()
# Example ((0, 7667791),)

To read data from an array into a Pandas dataframe, you can use the df operator:

A.df[:]

For dense arrays, this operator allows you to efficiently slice any subset of rows:

A.df[11:20]

TileDB is a columnar format and, therefore, allows you to efficiently subselect on columns / attributes as follows:

A.query(attrs=['attr1']).df[:]

Read From A Sparse Dataframe

Suppose you have ingested a CSV file into a 2D sparse array.

This array allows for efficient slicing on the two dimensions as follows:

# If both dimensions are integers
A.df[1:10, 1:100] 

# Or, natively on the datatype of the dimensions (e.g., datetime)
A.df[slice(np.datetime64("2019-01-01 00:00:00"), np.datetime64("2019-01-02 23:59:59")), 0:10]

You can prevent the Pandas dataframe from materializing the index columns (which will boost up reading performance) as follows:

A.query(index_col=[]).df[1:100, 0:10]

You can check the non-empty domain on the two dimensions as follows:

A.nonempty_domain()

Being a columnar format, TileDB allows you to efficiently subselect on attributes and dimensions as follows:

A.query(attrs=['attr1'], dims=['dim1']).df[:]

Read into Arrow Tables

If you are using Apache Arrow, TileDB can return dataframe results directly as Arrow Tables with zero-copy as follows:

A.query(return_arrow=True).df[:]

Read Using SQL

TileDB supports SQL via its integration with MariaDB. A simple example is shown below, but for more details read section Embedded SQL.

import tiledb.sql, pandas

db = tiledb.sql.connect()
pandas.read_sql(sql="select * from `<array_uri>`", con=db)

PreviousWrite Dataframe From CSV NextGroups

Last updated 11 months ago

Was this helpful?