The TileDB data format implements the array data model and has the following high-level goals:
Efficient storage with support for compressors and other filters (e.g., encryption, error correction codes, etc.)
Efficient ND slicing
Efficient access on cloud object stores (in addition to other backends)
Support for data versioning and time traveling
The TileDB data format is open-spec and can be found here (development branch), or in path
<version> is the TileDB Embedded version you are looking for (e.g.,
TileDB employs a multi-file data format, organized in folder hierarchies (or object prefix hierarchies on cloud object stores). A multi-file format is absolutely necessary for supporting fast concurrent writes, especially on cloud object stores (e.g., AWS S3) where all objects are immutable (i.e., to change a single byte in a TB-long object, you end up rewriting the entire TB-long object).
Each array is a folder that contains an array schema file (with information describing the array dimensions and attributes with their data types, along with other important metadata), and a set of timestamped subfolders called fragments. Each fragment corresponds to a batch write to the array, which can happen in any subdomain of the entire array domain (i.e., a fragment does not need to contain cells for the entire array domain). In addition, a fragment may correspond to a set of consolidated fragments, an important operation for optimizing performance in certain applications, described later. The fragment name consists of a random UUID, a timestamp range (in ms elapsed since epoch) and the format version of the fragment (for backwards compatibility). Moreover, each fragment is accompanied by an empty “ok” file, which is important for (eventual) consistency, to be discussed later. The figure below depicts an example of a 2D dense array with two attributes and 2 fragments, along with its file structure.
It is important to note that each fragment is self-contained, i.e., it includes all the information necessary to describe everything about that particular fragment. A dense array may be comprised of both dense and sparse fragments, whereas a sparse array may contain only sparse fragments. The next two sections explain in turn the contents of dense and sparse fragments.
The figure below shows the file organization within a dense fragment folder for the example above, where the two attributes are named
a2 (var-sized string). TileDB stores the values of each attribute in separate files, i.e., it is a “columnar” format (similar to Parquet). Every fixed-sized attribute generates one file, whereas every var-sized attribute generates two (one for the actual data and one for the starting positions of each var-sized value). Moreover, each fragment folder contains a fragment metadata file, which contains information about the non-empty domain (i.e., the bounding box where cell values are written into) and other important metadata that will be explained later.
To better understand the layout of the cell values in the data files inside each fragment, we must describe two important topics for performance: tiling and sorting. Most big data applications require compression, encryption, error correction codes and other filters on the data. For example, compression can lower the storage cost on cloud object stores and improve IO performance. The figure below shows the effect of tiling when slicing a subarray. If there is no tiling and we compress the data of the whole array, the whole data will need to be fetched into main memory, decompressed and processed to retrieve the results. By tiling and compressing each tile separately, we can fetch and decompress less data to get the results, which translates to better overall performance.
Now let’s turn to sorting, using the figure below that demonstrates its effects (the arrows specify the order of the cells in the file). Any storage medium is single-dimensional, i.e., addressable in a linear manner. However, a ND array is multi-dimensional. As such, there needs to be a way to map the multi-dimensional cells to a 1D order and use it to serialize the cell attribute values in their corresponding files, while still offering fast multi-dimensional slicing. The key is to define a 1D order that preserves the multi-dimensional locality of the cells for the typical shape of the slices. Ideally, each multi-dimensional slice should contain cells that appear contiguous in the file, as this will lead to the most efficient possible IO performance (e.g., minimizing the number of IO requests and hence the accumulated latency and cost).
It should already be evident that both tiling and the cell order must be specified depending on the application and slicing shapes. As such, TileDB flexibly allows the user to define any tiling and sorting via three parameters: (1) the space tile extents along each dimension, (2) the order of the cell in each tile (row-major means that the first dimension index runs the slowest and the last the fastest, whereas column-major means that the first dimension index runs the fastest and the last the slowest), and (3) the order of the tiles (row-major and column-major have similar definitions as above, but now on the tiles instead of the cells). These three parameters determine a unique global order for an array, which is essentially a space-filling curve (similar to Hilbert and z-order). The cell values are stored in the files in the same order as the visiting order of the space-filling curve. The figure below shows various global orders for different parameter settings.
Resuming our original 2D dense array example, the figure below shows the physical layout of the cell values in the data files, when the tile extents are 2x2 and both the tile and cell order are row-major. TileDB applies all the user-specified filters (which may include compression, encryption, error correction codes and more) on each tile. All the various filters and parameters are stored either in the array schema file (if they apply to all fragments, such as the tile extents, cell and tile order, filters, etc) or in the fragment metadata file (e.g., non-empty domain, offset and size of each compressed tile in the file, etc). This metadata is lightweight but suffices for quickly locating the relevant tiles given any ND slicing query during reading.
Sparse fragments are very similar to dense. One of the most important differences is that sparse fragments may contain empty cells, which are not materialized in the data files. As such, there should be a way to distinguish between empty and non-empty cells. TileDB achieves that by explicitly materializing the coordinates of the non-empty cells in extra data files, one per dimension. This is shown in the figure below.
Specifying a global order that determines the physical layout of the cell values in sparse fragments is identical to the dense case, i.e., a space-filling curve is defined by the space tile extents and the tile and cell order. This is shown in the figure below.
What changes in the sparse case is the physical tiling, i.e., the grouping of the cell values into tiles on file. Contrary to the dense case, in sparse arrays the user needs to specify a fourth parameter, called data tile capacity. In the dense case the data tile capacity is the same as the size of the space tile, but in the sparse case it may be different. The reason is that it is important to store the sparse values in data tiles of equal physical size, in order to address data skew and load-balance IO and compression / decompression upon writes and reads. The physical layout of the previous 2D sparse array example is shown below, assuming that the data tile capacity is set to 2 cells.
Even with materialized coordinates for the non-empty cells, it will be very challenging to achieve fast slicing without extra indexing information. TileDB employs the R-tree as its multi-dimensional index and incorporates it into the data format. Specifically, TileDB creates a minimum bounding rectangle (MBR) on the coordinates of each tile separately, and then hierarchically groups the MBRs into larger MBRs to create a tree hierarchy. The shape and size of the MBRs (and hence the effectiveness of the R-tree) depends on the global cell order and the data tile capacity, and thus must be tuned depending on the typical slicing workloads for each application. The R-tree is typically orders of magnitude smaller than the written fragment and, therefore, it can be easily fetched and stored in main-memory. The R-tree can be created very fast given the sorted order of the cells, and it is stored in the fragment metadata file inside the fragment folder.
As an additional note about R-trees, recall that each fragment in TileDB is immutable and that each new batch write creates a new fragment folder. This architectural decision allows TileDB to write concurrently from many writers, creating immutable R-trees in a bulk-loaded fashion (since it first sorts to create the MBRs), thus avoiding the expensive incremental updates of R-trees. This makes TileDB quite powerful in terms of both writes and reads.
TileDB supports storing small metadata in the form of key-value pairs in each array (noting that if you need support for a large number of key-value pairs, then you should consider storing them as a 1D sparse array as explained in the array model). This metadata is stored in a subfolder called
__meta inside the array folder, simply serialized in binary files. Those files are timestamped in the same manner as fragments for the same reasons (immutability, concurrent writes and time traveling). The metadata file organization is shown in the figure below.
Since the TileDB format is folder-based, we can take it one step further and organize multiple arrays in arbitrary folder hierarchies, called groups. A TileDB group is simply a folder with a special empty file and any number of nested groups and arrays. The figure below shows some examples.
The fact that the TileDB format is folder-based allows for very fast and easy copying between storage backends. For example, you can upload or download an entire array or group between your local storage and S3 using CLI command
aws s3 sync src_uri dest_uri.
TileDB allows the user to specify an ordered list of data transformations (such as compression, encryption, etc.) that can be applied to data tiles before they are written to disk. The user may define different filters for each attribute, the coordinates and the data tile offsets, all of which are specified in the array schema.
Before filtering each data tile of an attribute, TileDB internally divides the tile into disjoint chunks. These chunks are then filtered individually. Chunking allows for better cache behavior in terms of temporal locality, as the chunk size can be chosen to fit within the L1 cache of your processor cores. This helps especially with multi-stage filter lists, as the output from the previous filter is likely to still be in L1 when used as input for the current filter. Chunking also leads to better parallelism, as each chunk can be filtered in parallel, which can result in excellent CPU utilization when combined with the cache-friendly size of the chunks. The default chunk size used by TileDB is 64KB, which is the size of many common processor L1 caches.
TileDB supports a number of filters, described below, and more will continue to be added in the future.
There are several filters performing generic compression, which are the following:
GZIP: Compresses with Gzip
ZSTD: Compresses with Zstandard
LZ4: Compresses with LZ4
RLE: Compresses with run-length encoding
BZIP2: Compresses with Bzip2
DOUBLE_DELTA: Compresses with double-delta encoding
The byte shuffling process rearranges the bytes of the input attribute cell values in a deterministic and reversible manner designed to result in long runs of similar bytes that can be compressed more effectively by a generic compressor than the original unshuffled elements. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.
For example, consider three 32-bit unsigned integer values
1, 2, 3, which have the following little-endian representation when stored adjacent in memory:
0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 0x00 0x00 0x00
The byte shuffle operation will rearrange the bytes of these integer elements in memory such that the resulting array of bytes will contain each element’s first byte, followed by each element’s second byte, etc. After shuffling the bytes would therefore be:
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Observe the longer run of zero-valued bytes, which will compress more efficiently.
Bitshuffling is conceptually very similar to byteshuffling, but operates on the bit granularity rather than the byte granularity. Shuffling at the bit level can increase compression ratios even further than the byteshuffle filter, at the cost of increased computation to perform the shuffle.
Typically this filter is not used on its own, but rather it is immediately followed by a compression filter in a filter list.
This filter performs positive-delta encoding. Positive-delta encoding is a form of delta encoding that only works when the delta value is positive. Positive-delta encoding can result in better compression ratios on the encoded data. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.
For example, if the data being filtered was the sequence of integers
100, 104, 108, 112, ..., then the resulting positive-encoded data would be
0, 4, 4, 4, .... This encoding is advantageous in that producing long runs of repeated values can result in better compression ratios, if a compression filter is added after positive-delta encoding.
The filter operates on a “window” of values at a time, which can help in some cases to produce longer runs of repeated delta values.
This filter performs bit-width reduction, which is a form of compression.
Bit-width reduction examines a window of attribute values, and determines if all of the values in the window can be represented by a datatype of smaller byte size. If so, the values in the window are rewritten as values of the smaller datatype, potentially saving several bytes per cell.
For example, consider an attribute with datatype
uint64. Initially, each cell of data for that attribute requires 8 bytes of storage. However, if you know that the actual value of the attribute is often 255 or less, those cells can be stored using just a single byte in the form of a
uint8, saving 7 bytes of storage per cell. The bit-width reduction filter performs this analysis and compression automatically over windows of attribute data.
Additionally, each cell value in a window is treated relative to the minimum value in that window. For example, if the window size was 3 cells, which had the values
300, 350, 400, the bit-width reduction filter would first determine that the minimum value in the window was
300, and the relative cell values were
0, 50, 100. These relative values are now less than 255 and can be represented by a
If possible, it can be a good idea to apply positive-delta encoding before bit-width reduction, as the positive-delta encoding may further increase the opportunities to store windows of data with a narrower datatype.
TileDB allows you to encrypt your arrays at rest. It currently supports a single type of encryption, AES-256 in the GCM mode, which is a symmetric, authenticated encryption algorithm. When creating, reading or writing arrays you must provide the same 256-bit encryption key. The authenticated nature of the encryption scheme means that a message authentication code (MAC) is stored together with the encrypted data, allowing verification that the persisted ciphertext was not modified.
TileDB also has checksum support for checking data integrity. It currently supports MD5 and SHA256.
Crypto libraries used:
The TileDB format enables an important optimization, which is especially pronounced in cloud object stores. Since an array may consist of numerous fragments, a slicing query should be able to discard irrelevant fragments as quickly as possible. One way to do this is by checking the overlap of the slice with the non-empty domain of each fragment. This information is stored in the metadata files in the fragment folders. If there are numerous fragments, then there will be numerous REST requests to the object store to retrieve this information.
To solve this issue, TileDB enables consolidating the footers of the fragment metadata files into a single file with format
__t1_t2_uuid_v.meta. The file contains the fragment URIs whose footers are included, along with the footers in serialized binary form. The footers contain only very small information about the fragments, such as the non-empty domain and other light metadata.
t1 is the first timestamp of the first fragment whose metadata is being consolidated, and
t2 is the second timestamp of the last fragment. Upon opening an array, regardless of the number of fragments, TileDB can fetch this single small file in main memory with a single REST request. In other words, the TileDB format has mechanisms for making the communication protocol with the object store more lightweight.
The concept of immutable fragments allows you to write concurrently to a single array, but also to time travel (i.e., see versions of the array in between updates). However, numerous fragments may take a toll on performance, as cell values across fragments lose spatial locality in the underlying files (leading to more expensive IO). To mitigate this problem, TileDB supports consolidation, a process that merges a subset of fragments into a single one. The new fragment incorporates in its name the time range between the first and the last fragment it encompasses, e.g., the name may look like
t1 is the first timestamp of the first fragment being consolidated, and
t2 is the second timestamp of the last fragment.
Along with the consolidated fragment, consolidation produces also a vacuum file
__t1_t2_uuid_v.vac, i.e., with the same name as the consolidated fragment with added suffix
.vac. This file contains the URIs of the fragments that were consolidated. The user can choose to retain the consolidated fragments (for time traveling purposes) or vacuum them by deleting them. The
.vac files are used in the vacuum process so that only consolidated fragments get deleted.
The array metadata consolidation is identical to fragment consolidation. This is because metadata is stored in the same underlying format with timestamped fragments, so the same consolidation algorithm applies.