TileDB stores data as dense or sparse multi-dimensional arrays. The figure below demonstrates the data model.
An array (either dense or sparse) consists of:
Dimensions: The dimensions
d2 (figure above), along with their domains orient a multi-dimensional space of cells. A tuple of dimension values, e.g.,
(4,4), is called the cell coordinates. There can be any number of dimensions in an array.
Attributes: In each cell in the logical layout, TileDB stores a tuple comprised of any number of attributes, each of any data type (fixed- or variable-sized). All cells must store tuples with the same set and type of attributes. In the figure, cell
(4,4) stores an integer value for attribute
a1 and a string value for
a2, and similarly all other cells may have values for
Array metadata: This is (typically small) key-value data associated with an array.
Axes labels: These are practically other (dense or sparse) arrays attached to each dimension, which facilitate slicing multi-dimensional ranges on conditions other than array positional indices.
TileDB handles both dense and sparse arrays in a unified way, but there are a few differences between the two to be aware of:
Cells in sparse arrays may be empty, whereas in dense arrays all cells must have a value; in dense arrays, “empty” cells must store a zero, a special fill value or be marked as null). Typically empty cells in sparse arrays are not materialized in the persistent format.
Dimensions in dense arrays must be homogeneous (i.e., must have the same data type) and support only integer data types. Dimensions in sparse arrays may be heterogeneous (i.e., may have different data types), and they support any data type, even real or string.
In sparse arrays, there may be multiplicities of cells (i.e., there may be more than one cells with the same coordinates), whereas this is not possible in dense arrays.
Any system that implements any variation of the above array model strives primarily to do one thing: slice across the dimensions very (very!) fast. A slice is essentially a range (or a multi-range) query across each of the dimensions, which defines a hyper-subspace of the array (noting that equality queries are unary range queries) and retrieves the cell values therein. Slicing is used as a means of filtering the data, but also to implement out-of-core algorithms, which operate on data that may not fit in main memory and therefore need to be fetched in blocks (for example, a lot of Linear Algebra algorithms work in such a block-based fashion, such as SUMMA). No matter what the application, slicing data from some storage backend into main memory is arguably the predominant operation. The figure below shows some slicing examples. The bottom line is that if your workloads involve frequent slicing on one or more “fields” of your dataset, those fields qualify as dimensions and the rest as attributes.
Some datasets are inherently very sparse, i.e., contain a lot of empty (or zero) cells. For instance, genomic population variant datasets can be modeled as a 2D array where one dimension is the genomic samples (practically hundreds of thousands or millions, but theoretically unbounded as any number of samples can get added) and the other is the genomic positions (~3 billion for human genomes). Such datasets are extremely sparse (e.g., only 0.1% of non-empty cells) and it would be overkill to fill the empty cells of this enormous matrix with some special value. In addition to the enormous storage overhead (even with compression), running Linear Algebra operations on natively sparse arrays (which do not store empty cells) may be significantly faster than in dense. As an example, see this recent work on integrating TileDB with Bioconductor’s DelayedArray package, where “TileDB's fast reads and writes of sparse data are culminating in a PCA step that is nearly 3x as fast as compared to using HDF5”.
Another benefit of sparse arrays is the fact that they can support dimensions with real and string values. The reason is that any dataset with real or string values is practically sparse as it can only have values from a tiny portion of the entire real or string domain. This enables sparse arrays to capture applications with point data (e.g., AIS or LiDAR), i.e., 2D or 3D points/cells with real coordinates representing physical space. It also enables them to model key-value stores, where one or more key strings define the coordinates of the cell storing one or more values. Such applications can enjoy the rapid slicing performance that a sparse array system is designed to natively offer.
Dense or sparse arrays can efficiently and flexibly model any tabular data as dataframes. A dataframe is a collection of columns, where each column stores values of the same data type and the columns may have different data types. Depending on how you wish to efficiently slice a dataframe, you can choose to model it with either a 1D dense or a ND sparse array. For instance, if you’d like to slice based on row positions (e.g., for out-of-core operations), then you need to define an array with a single (non-materialized) dimension having a large integer domain (e.g.,
[0, MAX_UINT64]), and one attribute per column. On the other hand, if you’d like to efficiently slice based on ranges on a subset of columns, then you should define a ND sparse array, where the
Ndimensions are the
N columns you need fast slicing on, specifying the rest of the columns as attributes. This is depicted in the figure below. Generic dataframe modeling with sparse arrays was made possible in TileDB with the recent 2.0 release which introduced heterogeneous and string dimensions support.
The applications that can benefit from multi-dimensional arrays are truly endless:
Any kind of tabular data as 1D dense vectors or ND sparse arrays
Any time series data as 1D or ND (dense or sparse) arrays
Any 2D (e.g., AIS) or 3D (e.g., LiDAR) point data
Population genomic variants (VCF collections) as 2D sparse arrays
Single-cell transcriptomics data as 2D dense or sparse arrays
Any large dense or sparse matrices used in Linear Algebra applications
Weather data (e.g., coming in NetCDF form) as labeled ND dense arrays
Satellite imaging (e.g., SAR) as 2D dense arrays or 3D temporal image stacks
Biomedical imaging as ND dense arrays
Audio (1D) or video (3D) applications
Oceanic data observations
Graphs as sparse adjacency matrices