Multi-dimensional arrays are first class citizens in TileDB. An array is defined by specifying any number of dimensions, each with its own datatype and domain. The figure below shows an example of a 2D dense and a 2D sparse array, where the dimensions are
d2, each with domain
[1,4]. A combination of domain values, one per dimension, identifies an array cell. These are called the coordinates of the cell. Each cell can store any number of attribute values, each with its own datatype, but all cells store values for the same attributes. In other words, the dimensions and attributes of the array collectively define its schema, similar to the way columns define the schema of a database table or dataframe.
There are mainly three differences between a dense and a sparse array:
A dense array is used when the majority of the cells are non-empty, whereas a sparse array when the majority of the cells are empty.
The dimensions of a dense array must have the same datatype, whereas the dimensions of a sparse array may have different datatypes.
Every cell in a dense array is uniquely identified by its coordinates, whereas a sparse array can permit multiplicities, i.e., cells with the same coordinates but potentially different attribute values, as well as real (
Each dimension in the array can also come with axis labels, as shown in the figure above, allowing for useful annotations. Moreover each array may contain arbitrary metadata, which are typically key-value pairs carrying extra information about the array.
Care must be taken when deciding what qualifies as a dimension and what as an attribute. Typically, you want to define as a dimension data that you want to slice very efficiently with range queries. This is because array storage engines are architected specifically to make such queries very fast.
Although arrays can be multi-dimensional, all storage backends are one-dimensional. In other words, in order to store any data on any filesystem, you need to convert it into a serial byte stream. This is where all the sophistication of TileDB lies: serializing the data in a way that their multi-dimensional locality is preserved when querying hyper-rectangular subarrays. This is because such queries are faster when the requested data lie close to each other on the storage medium, because they can be retrieved with efficient IO (e.g., with fast scans instead of slow random accesses).
TileDB achieves IO performance for subarray queries in two ways:
Tiling: It intelligently and flexibly groups cells in hyper-rectangular tiles. Each tile is the atomic unit of IO and compression.
Columnar layout: When writing the array data to files/objects, it splits the values across each attribute. This allows for more effective compression and selection of attributes.
Multi-dimensional arrays are able to represent the vast majority of data in a natural way. Examples include, but are not limited to, the following:
Image: 2D dense array, where each cell is a pixel.
Video: 3D dense array; 2D frames across time (third dimension).
LiDAR: 3D sparse array where each point is a non-empty cell.
Genomics (gVCF): 2D sparse array, storing variants of samples at genomics positions.
Graph: 2D adjacency matrix.
Dataframe: ND sparse array, see more detailed discussion in Dataframes.
Adopting a universal format like arrays and a storage engine like TileDB has multiple benefits:
Arrays and dataframes are a natural data representation for most popular Data Science tools (e.g., Python numpy/Pandas, R/Spark/Dask dataframes, etc). Therefore, natively representing data as arrays in persistent storage can greatly improve performance when fetching the data into such tools, since expensive data conversion can be avoided.
You can take advantage of all TileDB features (e.g., encryption, efficient integration with Data Science tools, etc) and performance improvements (e.g., parallelism, cloud-optimized IO, compute push-down, etc), without having to reinvent the wheel specifically for your custom format.
You can define access policies and log activity universally with array semantics at the storage level, without having to rebuild such functionality in all the different computation tools you use and for every application domain.
Some of the TileDB features include:
Both dense and sparse: TileDB introduces a novel multi-dimensional array format that effectively handles both dense and sparse data, exposing a unified array API.
Columnar layout: TileDB is columnar, enabling compressibility and efficient attribute subselection.
Cloud-optimized: TileDB is architected with the cloud storage challenges in mind, such as as object immutability, eventual consistency, IO request limitations per object prefix, etc.
Efficient data updates: TileDB’s concept of immutable, append-only fragments built around the concept of LSM-Trees allows for rapid updates, while maintaining read performance.
Compression: TileDB offers high compression ratios while allowing efficient slicing via its tile-based approach. TileDB can compress array data with a growing number of compressors, such as GZIP, BZIP2, LZ4, ZStandard, double-delta and run-length encoding.
Encryption: TileDB allows you to encrypt your data at rest using AES256 encryption, adding a strong layer of security to your storage stack.
Parallelism: All the TileDB internals (e.g., compression/decompression, encryption/decryption, file I/O, slicing, etc.) are fully parallelized with Intel TBB. TileDB’s thread-/process-safe and asynchronous writes and reads enable users to build powerful parallel analytics on top of the TileDB storage engine.