Handling Dataframes

Dense Array vs. Dataframe

The figure below shows the relationship between a dense array and a dataframe. Each array attribute can be viewed as a dataframe column. The important difference is that in a dataframe there is no concept of global order as a map from a multi-dimensional space to the 1-dimensional space. Therefore, contrary to dense arrays, dataframes cannot be sliced multi-dimensionally, unless we materialize the cell coordinates in separate dataframe columns and build additional spatial indexes on top. TileDB does not require materializing coordinates or extra spatial indexing to efficiently slice multi-dimensional dense arrays (see Reading for more details).

Dense 2D array with 2 attributes (a1, a2) vs. dataframe with 2 columns

Sparse Array vs. Dataframe

The figure below shows the relationship between a sparse array and a dataframe. Each array attribute and dimension can be viewed as a dataframe column. To enable efficient multi-dimensional (i.e., multi-column) slicing, one must build spatial indexes on top of a dataframe. TileDB has fast, lightweight spatial indexing already built into its library. It also supports dimensions with different datatypes, including strings. Therefore, TileDB can capture a dataframe in its full generality.

Sparse 2D array with 2 attributes (a1, a2) vs. dataframe with 4 columns

Why Store Dataframes in TileDB?

TileDB adopts a similar effective columnar layout to formats like Parquet and integrates efficiently with the data science ecosystem. However, it also introduces several important features that cannot be found in legacy columnar formats, such as:

  • Multi-column slicing: By defining any subset of the columns as dimensions and due to TileDB's tiling flexibility, you can increase the pruning effectiveness of multi-column slicing, thus leading to better overall read performance. Essentially, a TileDB array acts as a primary multi-dimensional index on the columns selected as the array dimensions.

  • Data updates and versioning: TileDB offers rapid, parallel, cloud-optimized updates. All the update logic is pushed into the storage engine and is completely transparent to the user. TileDB also exposes useful time traveling functionality, such as reading arrays at time snapshots, effectively implementing data versioning built into a single embeddable library.

  • Partitioning: TileDB enables balanced partitioning, without limiting each partition to single column values. Moreover, we will soon expose API functions for dynamically selecting different partitioning schemes (e.g., on different subsets of columns with different orders), without the need for reorganizing/rewriting the array.

  • Sorting: All sorting is taken care of by TileDB internally with multi-threading.