Key Concepts & Data Format
It is important to understand the key concepts of TileDB and the way they are reflected on persistent storage to take full advantage of the power of the TileDB Open Source engine.
The TileDB data format implements the data model and has the following high-level goals:
Efficient storage with support for compressors, encryption, etc.
Efficient multi-dimensional slicing (i.e., fast reads)
Efficient access on cloud object stores (in addition to other backends)
Support for data versioning and time traveling (built into the format)
The TileDB data format is open-spec and can be found here (development branch), or in path https://github.com/TileDB-Inc/TileDB/blob/<version>/format_spec/FORMAT_SPEC.md
where <version>
is the TileDB Open Source version you are looking for (e.g., 2.6.3
).
Arrays
TileDB employs a multi-file data format, organized in directory hierarchies (or object prefix hierarchies on cloud object stores). A multi-file format is absolutely necessary for supporting fast concurrent writes, especially on cloud object stores (e.g., AWS S3) where all objects are immutable (i.e., to change a single byte in a TB-long object, you end up rewriting the entire TB-long object).
The general file structure for arrays looks as follows:
The main array components are the following:
Array schema (
__schema
)Fragments (
__fragments
)Consolidated fragment metadata (
__fragment_meta
)Commits (
__commits
)Array metadata (
__meta
)
The fact that the TileDB format is directory-based allows for very fast and easy copying between storage backends. For example, you can transfer an entire array or group between your local storage and AWS S3 using CLI command aws s3 sync src_uri dest_uri
.
Arrays can be hierarchically organized into groups.
Array schema
The array schema contains important information about the array definition, such as the number, type and names of dimensions, number, type and names of attributes, the domain, and more.
The file structure within the array schema directory is as follows:
<timestamped_name>
has format __timestamp_timestamp_uuid
, where:
timestamp
is timestamp in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)uuid
is a unique identifier
The timestamped files allow for versioning and time traveling. Each fragment (explained below) is associated with one of these array schema files.
Fragments
A fragment represents a timestamped write to a TileDB array. The figure below shows writes performed in a dense and a sparse array at two different timestamps t1
and t2 > t1
. The logical array view when the user reads the array at any timestamp t3 >= t2
contains all the cells written in the array before t3
, with the more recent cells overwriting the older cells. In the special case of sparse arrays that accepts duplicates (which can be specified in the array schema), if a cell was written more than one times, all cell replicas are maintained and returned to the user upon reads.
A fragment is stored as a directory with the following structure:
<timestamped_name>
has format __t1_t2_uuid_v
, where:
t1
andt2
are timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)uuid
is a unique identifierv
is the format version
The fragment metadata file (__fragment_metadata.tdb
) stores important data about the fragment, such as the name of its array schema, its non-empty domain, indexes, and other information that facilitates fast reads.
The array cell data (attribute values and dimension coordinates) are stored in files inside the fragment directory. There are the following types of files:
fixed-sized attribute data files, named
a1.tdb
,a2.tdb
, ...var-sized attribute data files, which are pairs of the form
(a1.tdb, a1_var.tdb)
,(a2.tdb, a2_var.tdb)
, ... . The second*_var.tdb
file of the pair contains the var-sized values, whereas the first contains the starting byte offsets of each value in the second file.fixed-sized dimension data files, named
d1.tdb
,d2.tdb
, ...var-sized dimension data files, which are pairs of the form
(d1.tdb, d1_var.tdb)
,(d2.tdb, d2_var.tdb)
, ... The second*_var.tdb
file of the pair contains the var-sized values, whereas the first contains the starting byte offsets of each value in the second file.validity data files for nullable attributes, named
a1_validity.tdb
,a2_validity.tdb
, ...., associated with attributea1
,a2
, ...., respectively. The validity files are simple bytevectors that indicate whether a cell value is null or not. The validity files are applicable to both fixed- and var-sized attributes, but they are not applicable to dimensions. They are also optional; the user may or may not specify an attribute as nullable.
A dense array does not materialize the dimension coordinates, whereas a sparse array must. The figure below shows a dense and a sparse array example. Observe that each cell value across all dimensions and attributes appears in the same absolute position across the corresponding files.
This layout where values of the same type are grouped together is ideal for compression, vectorization and subsetting on attributes/dimensions during queries (similar to other columnar databases).
The layout of the cell attribute values and dimension coordinates in the described files is very important for maximizing read performance, and is explained in detail in section Data layout along with the concept of tiles.
Consolidated Fragment Metadata
Each fragment metadata file (__fragment_metadata.tdb
) contains a small footer with lightweight indexing information, such as the non-empty domain of the fragment. This information can serve as another layer of indexing when issuing a slicing (read) query.
When opening the array to issue a read query, TileDB brings in main memory this lightweight indexing information that facilitates rapid result retrieval. If there are numerous fragments in the array, retrieving the footer from each fragment metadata file may be time consuming.
To mitigate this issue, TileDB offers consolidation of fragment metadata. This operation groups the footers from all fragment metadata files in a single file, stored in folder __fragment_meta
as shown below.
<timestamped_name>.meta
has format __t1_t2_uuid_v
, where:
t1
andt2
are the timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC) of the oldest and most recent fragment whose fragment metadata footer was consolidated in this fileuuid
is a unique identifierv
is the format version
Upon opening the array, TileDB first reads the __fragment_meta/*.meta
files to retrieve the footers, eliminating the need to get the individual footers from each fragment metadata file, resulting in a considerable boost in array opening performance.
Commits
When a new fragment is successfully created, a file <timestamped_name>.wrt
is created, where <timestamped_name>
is the same as the name of the corresponding fragment folder created in __fragments
. Since there may be numerous fragments created in an array, TileDB allows for consolidating the commit files into a single file <timestamped_name>.con
, which contains the names of the fragments whose commits are being consolidated. The name of the consolidated file contains the timestamps of the first and last commit file it consolidated. The consolidated commit file helps reduce the number of __commits/*.wrt
files, which further boosts the performance of opening the array for reading.
Array metadata
TileDB supports storing array metadata in the form of key-value pairs in each array. This metadata is stored in a directory called __meta
inside the array directory, with all key and value binary data items serialized into a generic tile with GZIP compression. Those files are timestamped in the same manner as fragments for the same reasons (immutability, concurrent writes and time traveling). The metadata file organization is shown below.
<timestamped_name>
has format __t1_t2_uuid_v
, where:
t1
andt2
are timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)uuid
is a unique identifierv
is the format version
The vacuum files *.vac
are explained in section Consolidation.
Groups
The general file structure for groups looks as follows:
There are three components:
__tiledb_group.tdb
is an empty file indicating thatmy_group
is a TileDB group.__meta
stores key-value metadata associated with the group. This functionality is identical to that of array metadata.__group
contains timestamped files (with a similar structure to those described above for all other components, such as fragments, consolidated fragment metadata, etc), which store the absolute paths of other arrays and groups.
Note that a group can logically contain other nested groups and arrays. Their paths are included in the files in __group
. However, the physical location of the actual groups and arrays may be in paths other than inside the group location on storage. This provides a lot of flexibility in dynamically grouping various arrays and groups, especially on (cloud) object storage, without physically moving enormous quantities of data from one physical path to another.
Data layout
Typical queries slice subsets of the array values (i.e., they select subarrays). Therefore, TileDB does not compress the whole attribute / dimension files as single blobs. Instead, it tiles (i.e., chunks) them into smaller blocks. The cells in each tile will appear as contiguous in the corresponding files as shown in the figure above.
A tile is the atomic unit of IO and compression, (as well as other potential filters, such as encryption, checksum, etc.).
A question arises: how do we sort the values from a multi-dimensional space, inside the files that are single-dimensional. This order will dictate the position of each cell within its corresponding tile, and the position of the tiles in the file. We call this order the global order of the cells in the array. The global order and tiling for dense and sparse arrays will be explained separately.
In a dense array, the global order and tiling is determined by 3 parameters:
The space tile extent per dimension.
The cell order inside each tile
The tile order
The figure below depicts three different orders as we vary the above parameters. Observe that the shape and size of each tile is dictated solely by the space tile extents.
The sparse array case is slightly different, since tiling only based on space tiles may lead to highly imbalanced numbers of non-empty cells in each tile, which can further impact compressibility and slicing performance.
In sparse arrays, the global order could be determined as follows:
By specifying the same three parameters as in the dense case, or
By using a Hilbert space-filling curve
Once the global order is determined, the tiling is specified by an extra parameter, called the tile capacity, i.e., the number of non-empty cells in each tile. The figure below depicts different global orders for a different choice of all the above mentioned parameters for sparse arrays (a non-empty cell is depicted in dark blue).
Why is the global order and tiling such a big deal? The global order should retain as much as possible the co-locality of your query results, for the majority of your typical slicing shapes. Remember, the array is multi-dimensional, whereas the file storing the array data is single-dimensional. You have a single chance (unless you want to pay for redundancy) to sort your data in a single 1D order. And that order absolutely dictates the performance of your multi-dimensional queries. The reason is that the closer your results appear in the file, the faster the IO operations to retrieve them. Also the size of the tile can affect performance, since integral tiles will be fetched from storage to memory. The examples below demonstrate some good and bad global orders and tilings for a given slice, focusing on a dense array (similar arguments can be made for the sparse case).
Indexing
Now that we have explained the on-disk format, how do we efficiently slice an array and what indices does TileDB build to facilitate the query? First we focus on a dense array and use the example in the figure below. In addition to the slicing query, we know the following from the array schema: the number of dimensions, the global order, the tiling, and the fact that there are no empty cells in the array. Using solely the array schema and with simple arithmetic, we can calculate the number, size and location of cell slabs (i.e., sets of contiguous cells on disk) that comprise the query result, without any additional index. TileDB implements an efficient multi-threaded algorithm that can fetch the relevant tiles from disk, decompress, and copy the cell slabs into the result buffers, all in parallel.
Slicing in sparse arrays is more difficult because we do not know the location of empty cells until the array is written. Therefore, unlike dense arrays, we need to explicitly store the coordinates of the non-empty cells, and build an index on top of them. The index must be small in size so that it can be quickly loaded in main memory when the query is submitted. In TileDB we use an R-tree as an index. The R-tree groups the coordinates of the non-empty cells into minimum bounding rectangles (MBRs), one per tile, and then recursively groups those MBRs into a tree structure. The figure below shows an example. The slicing algorithm then traverses the R-tree to find which tile MBRs overlap the query, fetches in parallel those tiles from storage, and decompresses them. Then for every partially overlapping tile, the algorithm needs to further check the coordinates one by one to determine whether they fall inside the query slice. TileDB implements this simple algorithm with multi-threading and vectorization, leading to extremely efficient multi-dimensional slicing in sparse arrays.
The above indexing information, along with other auxiliary data (e.g., byte offsets of tiles in the files on disk) is stored in the fragment metadata file of each fragment.
Consolidation
Fragment metadata
The TileDB format enables an important optimization, which is especially pronounced in cloud object stores. Since an array may consist of numerous fragments, a slicing query should be able to discard irrelevant fragments as quickly as possible. One way to do this is by checking the overlap of the slice with the non-empty domain of each fragment. This information is stored in the __fragment_metadata.tdb
files in the fragment folders. If there are numerous fragments, then there will be numerous REST requests to the object store to retrieve this information.
To solve this issue, TileDB enables consolidating the footers of the fragment metadata files into a single file with format __t1_t2_uuid_v.meta
. The file contains the fragment URIs whose footers are included, along with the footers in serialized binary form. The footers contain only very small information about the fragments, such as the non-empty domain and other light metadata. t1
is the first timestamp of the first fragment whose metadata is being consolidated, and t2
is the second timestamp of the last fragment. Upon opening an array, regardless of the number of fragments, TileDB can fetch this single small file in main memory with a single REST request. In other words, the TileDB format has mechanisms for making the communication protocol with the object store more lightweight.
Fragments
The concept of immutable fragments allows you to write concurrently to a single array, but also to time travel (i.e., see versions of the array in between updates). However, numerous fragments may take a toll on performance, as cell values across fragments lose spatial locality in the underlying files (leading to more expensive IO). To mitigate this problem, TileDB supports fragment consolidation, a process that merges a subset of fragments into a single one. The new fragment incorporates in its name the time range between the first and the last fragment it encompasses, e.g., the name may look like __t1_t2_uuid3_v
, wheret1
is the first timestamp of the first fragment being consolidated, and t2
is the second timestamp of the last fragment.
Along with the consolidated fragment, consolidation produces also a vacuum file __t1_t2_uuid3_v.vac
, i.e., with the same name as the consolidated fragment with added suffix .vac
. This file contains the URIs of the fragments that were consolidated by fragment __t1_t2_uuid3_v
, and it is stored in the __commits
folder. The user can then vacuum them, i.e., permanently delete the consolidated fragments. The __commits/*.vac
files are used in the vacuum process so that only consolidated fragments get deleted.
Commits
The commit files can also be consolidated. Upon commit consolidation, TileDB produces a __t1_t2_uuid_v.con
file in folder __commits
, which stores the URIs of the fragments whose commits were consolidated in this file. The consolidated commits can then be vacuumed, leaving a single commit file in the __commits
folder. That leads to a significant boost of opening the array for reads in the presence of a large number of fragments.
Array metadata
The array metadata consolidation is very similar to fragment consolidation. An example is shown below. The produced __t1_t2_uuid3.vac
files is used in vacuuming to delete the array metadata files that participated in consolidation.
Vacuuming
The vacuuming process permanently deletes consolidated fragments, fragment metadata, commits and array metadata, in order to clean up space. TileDB separates consolidation from vacuuming, in order to make consolidation process-safe in the presence of concurrent reads and writes. On the other hand, vacuuming is not process-safe and the user should take extra care when invoking it.
Fragment metadata
Fragment metadata consolidation can be run multiple times, resulting in multiple __fragment_meta/*.meta
files. TileDB allows vacuuming those fragment metadata files by keeping the latest *.meta
file and deleting the rest. An example is shown below.
Fragments
Vacuuming for fragments results in deleting fragment that took part in a consolidation process. We distinguish two cases.
In the first case, suppose that no commits have been consolidated. Vacuuming will delete the two fragment that were consolidated from the __fragments
folder, the two corresponding commit *.wrt
files and the *.vac
file from __commits
folder. An example is shown below.
In the second case, suppose that the commits have been consolidated and vacuumed for simplicity, after writing two fragments and running consolidation on the fragments. Vacuuming will delete the two fragment that were consolidated from the __fragments
folder, and the *.vac
file from the __commits
folder as in the first case. But this time, it will also create a new file __commits/__t1_t2_uuid5_v.ign
. This file indicates that the vacuumed fragment URIs should be ignored from the consolidated commit file __t1_t2_uuid4_v.con
upon opening the array for reading.
Commits
Vacuuming commits deletes the commit files of the fragments which participated in a commit consolidation process (included in __commits/*.com
files). An example is shown below:
Array metadata
Vacuuming consolidated array metadata deletes the array metadata files that participated in the consolidation, plus the corresponding *.vac
files. An example is shown below.
Tile filtering
TileDB allows the user to specify an ordered list of data transformations (such as compression, encryption, etc.) that can be applied to data tiles before they are written to disk. The user may define different filters for each attribute, the coordinates and the data tile offsets, all of which are specified in the array schema.
TileDB supports a number of filters, described below, and more will continue to be added in the future.
Compression filters
There are several filters performing generic compression, which are the following:
GZIP: Compresses with Gzip
ZSTD: Compresses with Zstandard
LZ4: Compresses with LZ4
RLE: Compresses with run-length encoding
BZIP2: Compresses with Bzip2
DOUBLE_DELTA: Compresses with double-delta encoding
Byteshuffle
This filter performs byte shuffling of data as a way to improve compression ratios. The byte shuffle implementation used by TileDB comes from the Blosc project.
The byte shuffling process rearranges the bytes of the input attribute cell values in a deterministic and reversible manner designed to result in long runs of similar bytes that can be compressed more effectively by a generic compressor than the original unshuffled elements. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.
For example, consider three 32-bit unsigned integer values 1, 2, 3
, which have the following little-endian representation when stored adjacent in memory:
The byte shuffle operation will rearrange the bytes of these integer elements in memory such that the resulting array of bytes will contain each element’s first byte, followed by each element’s second byte, etc. After shuffling the bytes would therefore be:
Observe the longer run of zero-valued bytes, which will compress more efficiently.
Bitshuffle
This filter performs bit shuffling of data as a way to improve compression ratios. The bitshuffle implementation used in TileDB comes from https://github.com/kiyo-masui/bitshuffle.
Bitshuffling is conceptually very similar to byteshuffling, but operates on the bit granularity rather than the byte granularity. Shuffling at the bit level can increase compression ratios even further than the byteshuffle filter, at the cost of increased computation to perform the shuffle.
Typically this filter is not used on its own, but rather it is immediately followed by a compression filter in a filter list.
Positive-delta encoding
This filter performs positive-delta encoding. Positive-delta encoding is a form of delta encoding that only works when the delta value is positive. Positive-delta encoding can result in better compression ratios on the encoded data. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.
For example, if the data being filtered was the sequence of integers 100, 104, 108, 112, ...
, then the resulting positive-encoded data would be 0, 4, 4, 4, ...
. This encoding is advantageous in that producing long runs of repeated values can result in better compression ratios, if a compression filter is added after positive-delta encoding.
The filter operates on a “window” of values at a time, which can help in some cases to produce longer runs of repeated delta values.
Positive-delta encoding is particularly useful for the offsets of variable-length attribute data, which by definition will always have positive deltas. The above example of the form 100, 104, 108, 112
can easily arise in the offsets, if for example you have a variable-length attribute of 4-byte values with mostly single values per cell instead of a variable number.
Bit width reduction
This filter performs bit-width reduction, which is a form of compression.
Bit-width reduction examines a window of attribute values, and determines if all of the values in the window can be represented by a datatype of smaller byte size. If so, the values in the window are rewritten as values of the smaller datatype, potentially saving several bytes per cell.
For example, consider an attribute with datatype uint64
. Initially, each cell of data for that attribute requires 8 bytes of storage. However, if you know that the actual value of the attribute is often 255 or less, those cells can be stored using just a single byte in the form of a uint8
, saving 7 bytes of storage per cell. The bit-width reduction filter performs this analysis and compression automatically over windows of attribute data.
Additionally, each cell value in a window is treated relative to the minimum value in that window. For example, if the window size was 3 cells, which had the values 300, 350, 400
, the bit-width reduction filter would first determine that the minimum value in the window was 300
, and the relative cell values were 0, 50, 100
. These relative values are now less than 255 and can be represented by a uint8
value.
If possible, it can be a good idea to apply positive-delta encoding before bit-width reduction, as the positive-delta encoding may further increase the opportunities to store windows of data with a narrower datatype.
Bit-width reduction only works on integral datatypes.
Encryption and checksums
TileDB allows you to encrypt your arrays at rest. It currently supports a single type of encryption, AES-256 in the GCM mode, which is a symmetric, authenticated encryption algorithm. When creating, reading or writing arrays you must provide the same 256-bit encryption key. The authenticated nature of the encryption scheme means that a message authentication code (MAC) is stored together with the encrypted data, allowing verification that the persisted ciphertext was not modified.
TileDB also has checksum support for checking data integrity. It currently supports MD5 and SHA256.
Crypto libraries used:
macOS and Linux: OpenSSL
Windows: Next generation cryptography (CNG)
Last updated