Read Dataframes

A dataframe is a specialization of an array (see Use Cases). As such, any TileDB API works natively for writing to and reading from a dataframe modeled as an array. However, Python Pandas has a popular offering for dataframes in main memory and, therefore, TileDB offers special optimized reading functionality to read directly from an array into a Pandas dataframe. This How To guide describes this functionality.

Sections Create Dataframes and Write Dataframes describe how to ingest a dataframe into a 1D dense or a ND sparse array. This section covers how to read from the ingested dataframes directly into either Pandas or Arrow Table. Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.

Read the Dataframe Schema

Since the dataframe is an array, you can read the underlying schema in the same manner as for arrays as follows:

A = tiledb.open("my_array", mode="r")
A.schema

Read From A Dense Dataframe

Suppose you have ingested a CSV file into a 1D dense array.

To find out how many rows were ingested, you can take a look at the array non-empty domain:

A = tiledb.open("my_array", mode="r")
A.nonempty_domain()
# Example ((0, 7667791),) 

To read data from an array into a Pandas dataframe, you can use the df operator:

A.df[:]

For dense arrays, this operator allows you to efficiently slice any subset of rows:

A.df[11:20]

TileDB is a columnar format and, therefore, allows you to efficiently subselect on columns / attributes as follows:

A.query(attrs=['attr1']).df[:]

Read From A Sparse Dataframe

Suppose you have ingested a CSV file into a 2D sparse array.

This array allows for efficient slicing on the two dimensions as follows:

# If both dimensions are integers
A.df[1:10, 1:100] 

# Or, natively on the datatype of the dimensions (e.g., datetime)
A.df[slice(np.datetime64("2019-01-01 00:00:00"), np.datetime64("2019-01-02 23:59:59")), 0:10]

You can prevent the Pandas dataframe from materializing the index columns (which will boost up reading performance) as follows:

A.query(index_col=[]).df[1:100, 0:10]

You can check the non-empty domain on the two dimensions as follows:

A.nonempty_domain()

Being a columnar format, TileDB allows you to efficiently subselect on attributes and dimensions as follows:

A.query(attrs=['attr1'], dims=['dim1']).df[:]

Read into Arrow Tables

If you are using Apache Arrow, TileDB can return dataframe results directly as Arrow Tables with zero-copy as follows:

A.query(return_arrow=True).df[:]

Read Using SQL

TileDB supports SQL via its integration with MariaDB. A simple example is shown below, but for more details read section Embedded SQL.

import tiledb.sql, pandas

db = tiledb.sql.connect()
pandas.read_sql(sql="select * from `<array_uri>`", con=db)

Last updated