A dataframe is a specialization of an array (see Use Cases). As such, any TileDB API works natively for writing to and reading from a dataframe modeled as an array. However, Python Pandas has a popular offering for dataframes in main memory and, therefore, TileDB offers special optimized reading functionality to read directly from an array into a Pandas dataframe. This How To guide describes this functionality.
Sections Create Dataframes and Write Dataframes describe how to ingest a dataframe into a 1D dense or a ND sparse array. This section covers how to read from the ingested dataframes directly into either Pandas or Arrow Table. Throughout the section, we call dense dataframe a dataframe modeled as a 1D dense array, and sparse dataframe a dataframe modeled as a ND sparse array.
Since the dataframe is an array, you can read the underlying schema in the same manner as for arrays as follows:
Suppose you have ingested a CSV file into a 1D dense array.
To find out how many rows were ingested, you can take a look at the array non-empty domain:
To read data from an array into a Pandas dataframe, you can use the df
operator:
For dense arrays, this operator allows you to efficiently slice any subset of rows:
TileDB is a columnar format and, therefore, allows you to efficiently subselect on columns / attributes as follows:
Suppose you have ingested a CSV file into a 2D sparse array.
This array allows for efficient slicing on the two dimensions as follows:
You can prevent the Pandas dataframe from materializing the index columns (which will boost up reading performance) as follows:
You can check the non-empty domain on the two dimensions as follows:
Being a columnar format, TileDB allows you to efficiently subselect on attributes and dimensions as follows:
If you are using Apache Arrow, TileDB can return dataframe results directly as Arrow Tables with zero-copy as follows:
TileDB supports SQL via its integration with MariaDB. A simple example is shown below, but for more details read section Embedded SQL.