Dataframes do not comprise a single use case, but rather a category of use cases. They are the first class citizens of relational databases, and are frequently used in Data Science applications in Pandas, R, Spark and Dask. There is a natural mapping between a dataframe and a multi-dimensional sparse array. Therefore, TileDB is ideal for managing dataframes with fast N-dimensional slicing and built-in update and partitioning support.
A dataframe models tabular data that are potentially heterogeneous. Specifically, it consists of a set of rows and columns, where each column has a name and a datatype, and all the values of all rows across a column have the same datatype. Relational databases and Spark use SQL to query dataframes, whereas R, Pandas and Dask offer their own APIs.
Apache Parquet (now part of Apache Arrow) is arguably the most popular dataframe format. It adopts a columnar layout, in the sense that values of each column are grouped together in order to improve locality and compressibility. It is widely adopted and efficiently integrated with a lot of software, including Dask and Spark.
The main problem with Parquet is the fact that it is a format specification, rather than a full-fledged and stand-alone storage engine. As a result, several important storage features are delegated to the higher-level applications. This means that every software that wishes to integrate with Parquet must implement those missing features.
We summarize the main issues with Parquet below:
Data updates: Parquet does not support updates. Most applications create separate timestamped Parquet files for batches of updates. Writing the new files as well as querying the dataframe in the presence of updates is delegated to the higher-level applications.
Partitioning: Parallel computing software like Spark and Dask achieve scalability and performance by working on dataframe partitions in parallel. A typical way to partition Parquet dataframes is to create Partitioned Datasets, i.e., collections of Parquet files grouped into directory hierarchies in the storage backend. Each subdirectory is a partition that may contain other partitions. One problem is that each partition is created on a single column value, not a range of values. Another problem is that once the partitioning has been created, it cannot be altered, as that would mean rearranging and rewriting the entire dataset. Finally, all the partition creation and read logic is pushed to the application.
Sorting: Ordering the dataframe values on one or more columns can significantly improve compression on those columns when stored on disk, due to Parquet's columnar layout and the fact that adjacent values will be "similar" and, thus, better compressible. It also leads to improved read performance, because Parquet chunks the data into pages and stores the min/max values in each page, thus enabling skipping pages that do not satisfy the query. Unfortunately, the higher-level application is responsible for sorting the dataframes.
Multi-column slicing: When using Parquet from a higher-level application (e.g., Spark), you typically sort a dataframe only on one column. This is quite limiting when you need to slice on more than one columns, as a single column may not be selective enough. As a result, you end up fetching and decompressing more Parquet pages than necessary.
A dataframe can be naturally represented as a ND sparse array, where N is the number of columns you would like your data to be sorted in. More specifically, you can define the columns you most frequently slice on as dimensions, and the the rest of the columns as attributes. Defining the space tiling and tile/cell order appropriately, you can optimize the effectiveness of multi-dimensional slicing and partitioning in your application.
Storing dataframes as ND sparse arrays in TileDB you can enjoy the benefits of Parquet without its limitations. Specifically, TileDB adopts a similar effective columnar layout to Parquet and integrates efficiently with the Data Science ecosystem, but it also solves Parquet's problems as follows:
Data updates: TileDB offers rapid, parallel, cloud-optimized updates. All the update logic is pushed into the storage engine and is completely transparent to the user. TileDB also exposes useful time traveling functionality, such as reading arrays at time snapshots.
Partitioning: TileDB enables balanced partitioning, without limiting each partition to single column values. Moreover, we will soon expose API functions for dynamically selecting different partitioning schemes (e.g., on different subsets of columns with different orders), without the need for reorganizing/rewriting the array.
Sorting: All sorting is taken care of by TileDB internally with multi-threading.
Multi-column slicing: By defining a subset of the columns as dimensions and due to TileDB's tiling flexibility, you can increase the pruning effectiveness of multi-column slicing, thus leading to better overall read performance.