1 of 12

Internal Mechanics

This group of pages describes the internal mechanics of the TileDB Open Source storage engine. It is meant for more advanced users that would like to better understand how we implement the array model and format to achieve excellent performance and provide features such as atomicity, concurrency, eventual consistency, data versioning, time traveling, and consolidation.

Writing

TileDB is architected to support parallel batch writes, i.e., writing collections of cells with multiple processes or threads. Each write operation creates one or more dense or sparse fragments. Updating an array is equivalent to initiating a new write operation, which could either insert cells in unpopulated areas of the domain or overwrite existing cells (or a combination of the two). TileDB handles each write separately and without any locking. Each fragment is immutable, i.e., write operations always create new fragments, without altering any other fragment.

Dense Writes

A dense write is applicable to dense arrays and creates one or more dense fragments. In a dense write, the user provides:

The to write into (it must be single-range).
The buffers that contain the attribute values of the cells that are being written.
The cell order within the subarray (which must be common across all attributes), so that TileDB knows which values correspond to which cells in the array domain. The cell order may be row-major, column-major, or global.

The example below illustrates writing into a subarray of an array with a single attribute. The figure depicts the order of the attribute values in the user buffers for the case of row- and column-major cell order. TileDB knows how to appropriately re-organize the user-provided values so that they obey the global cell order before storing them to disk. Moreover, note that TileDB always writes integral space tiles to disk. Therefore, it will inject special empty values (depicted in grey below) into the user data to create full data tiles for each space tile.

Writing in the array global order needs a little bit more care. The subarray must be specified such that it coincides with space tile boundaries, even if the user wishes to write in a smaller area within that subarray. The user is responsible for manually adding any necessary empty cell values in her buffers. This is illustrated in the figure below, where the user wishes to write in the blue cells, but has to expand the subarray to coincide with the two space tiles and provide the empty values for the grey cells as well. The user must provide all cell values in the global order, i.e., following the tile order of the space tiles and the cell order within each space tile.

Writing in global order requires knowledge of the space tiling and cell/tile order, and is rather cumbersome to use. However, this write mode leads to the best performance, because TileDB does not need to internally re-organize the cells along the global order. It is recommended for use cases where the data arrive already grouped according to the space tiling and global order (e.g., in geospatial applications).

Default Fill Values

TileDB uses the following default fill values for empty cells in dense writes, noting that the user can specify any other fill value upon array creation:

In the case a fixed-sized attribute stores more than one values, all the cell values will be assigned the corresponding default value shown above.

Sparse Writes

Sparse writes are applicable to sparse arrays and create one or more sparse fragments. The user must provide:

The attribute values to be written.
The coordinates of the cells to be written.
The cell layout of the attribute and coordinate values to be written (must be the same across attributes and dimensions). The cell layout may be unordered or global.

Note that sparse writes do not need to be constrained in a subarray, since they contain the explicit coordinates of the cells to write into. The figure below shows a sparse write example with the two cell orders. The unordered layout is the easiest and most typical. TileDB knows how to appropriately re-organize the cells along the global order internally before writing the values to disk. The global layout is once again more efficient but also more cumbersome, since the user must know the space tiling and the tile/cell order of the array, and manually sort the values before providing them to TileDB.

Consolidation

The presence of numerous fragments may impact the TileDB read performance. This is because many fragments would lead to fragment metadata being loaded to main memory from numerous different files in storage. Moreover, the locality of the result cells of a subarray query may be destroyed in case those cells appear in multiple different fragment files, instead of concentrated byte regions within the same fragment files.

To mitigate this problem, TileDB has a consolidation feature, which allows you to merge

Lightweight fragment metadata footers into a single file.
A subset of fragments into a single fragment.
A subset of array metadata files into a single one.

Consolidation is thread-/process-safe and can be done in the background while you continue reading from the array without being blocked. Moreover, consolidation does not hinder the ability to do time traveling at a fine granularity, as it does not delete fragments that participated in consolidation (and, therefore, they are still queryable). The user is responsible for vacuuming fragments, fragment metadata and array metadata that got consolidated to save space, at the cost of not being able to time travel across the old (finer) fragments.

Fragment metadata consolidation

Each fragment metadata file (located in a fragment folder) contains some lightweight information in its footer. This is mostly the non-empty domain and offsets for other metadata included in other parts of the file. If there are numerous fragments, reading the array may be slow on cloud object stores due to the numerous REST requests to fetch the fragment metadata footers. TileDB offers a consolidation process (with mode fragment_meta), which merges the fragment metadata footers of a subset of fragments into a single file that has suffix .meta, stored in the array folder. This file is named similarly to fragments, i.e., it carries a timestamp range that helps with time traveling. It also contains all the URIs of the fragments whose metadata footers are consolidated in that file. Upon reading an array, only this file is efficiently fetched from the backend, since it is typically very small in size (even for hundreds of thousands of fragments).

Fragment consolidation

If mode fragments is passed to the consolidation function, then the fragment consolidation algorithms is executed, which is explained in detail below.

Basic concepts

There are two important points to stress regarding fragment consolidation:

Consolidating dense fragments produces a dense fragment, and may induce fill values.
Consolidating fragments where all fragments are sparse produces a sparse fragment.

The figure below shows consolidation of two dense fragments, the first containing only full tiles, and the second containing two tiles with a single cell written to each. Note that this can occur only in dense arrays, since sparse arrays can have only sparse fragments. The array in the figure has a 2x2 space tiling. Recall that a dense fragment consists of a dense hyper-rectangle and that it stores only integral tiles. Due to the partial cell in the second fragment that is located in the lower left space tile, the dense hyper-rectangle of the produced consolidated dense fragment must cover all four space tiles. Therefore, TileDB must fill the empty cells in this hyper-rectangle with empty values, illustrated in grey color in the figure below.

Consolidating only sparse fragments is simpler. The figure below illustrates consolidation of two sparse fragments, where the resulting consolidated fragment is also sparse and there is no injection of empty values.

Preprocessing

Before the consolidation algorithm begins, TileDB applies a simple optimization in a pre-processing step, which may lead to great performance benefits depending on the “shape” of the existing fragments. Specifically, TileDB identifies dense fragments whose non-empty domain completely covers older adjacent (dense or sparse) fragments, and directly deletes the old fragment directories without performing any actual consolidation.

This clean-up process is illustrated with an example in the figure below. Suppose the first fragment is dense and covers the entire array, i.e., [1,4], [1,4], the second is dense and covers [1,2], [1,2], the third is sparse as shown in the figure, and the fourth one is dense covering [1,2], [1,4]. Observe that, if those four fragments were to be consolidated, the cells of the second and third fragment would be completely overwritten from the cells of the fourth fragment. Therefore, the existence of those two fragments would make no difference to the consolidation result. Deleting them altogether before the consolidation algorithm commences will result in boosting the algorithm performance (since fewer cells will be read and checked for overwrites).

Algorithm

The consolidation algorithm is performed in steps. In each step, a subset of adjacent (in the timeline) fragments is selected for consolidation. The algorithm proceeds until a determined number of steps were executed, or until the algorithm specifies that no further fragments are to be consolidated. The choice of the next fragment subset for consolidation is based on certain rules and user-defined parameters, explained below. The number of steps is also configurable, controlled by sm.consolidation.steps.

Let us focus on a single step, during which the algorithm must select and consolidate a subset of fragments based on certain criteria:

The first criterion is if a subset of fragments is “consolidatable”, i.e., eligible for consolidation in a way that does not violate correctness. Any subset consisting of solely sparse fragments is always consolidatable. However, if a fragment subset contains one or more dense fragments, TileDB performs an important check; if the union of the non-empty domains of the fragments (which is equal to the non-empty domain of the resulting consolidated fragment) overlaps with any fragment created prior to this subset, then the subset is marked as non-consolidatable. Recall that the fragment that results from consolidating a subset of fragments containing at least one dense fragment is always a dense fragment. Therefore, empty regions in the non-emtpy domain of the consolidated fragment will be filled with special values. Those values may erroneously overwrite older valid cell values. Such a scenario is illustrated in the figure below. The second and third fragments are not consolidatable, since their non-empty domain contains empty regions that overlap with the first (older) fragment. Consequently, consolidating the second and third fragment results in a logical view that is not identical to the one before consolidation, violating correctness. This criterion detects and prevents such cases.

The second criterion is the comparative fragment size. Ideally, we must consolidate fragments of approximately equal size. Otherwise, we may end up in a situation where, for example, a 100GB fragment gets consolidated with a 1MB one, which would unnecessarily waste consolidation time. This is controlled by parameter sm.consolidation.step_size_ratio; if the size ratio of two adjacent fragments is smaller than this parameter, then no fragment subset that contains those two fragments will be considered for consolidation.
The third criterion is the fragment amplification factor, applicable to the case where the fragment subset to be consolidated contains at least one dense fragment. If the non-empty domain of the resulting fragment has too many empty cells, its size may become considerably larger than the sum of sizes of the original fragments to be consolidated. This is because the consolidated fragment is dense and inserts special fill values for all empty cells in its non-empty domain (see figure below). The amplification factor is the ratio between the consolidated fragment size and the sum of sizes of the original fragments. This is controlled by sm.consolidation.amplification, which should not be exceed for a fragment subset to be eligible for consolidation. The default value 1.0 means that the fragments will be consolidated if there is no amplification at all, i.e., if the size of the resulting consolidated fragment is smaller than or equal to the sum of sizes of the original fragments. As an example, this happens when the non-empty domain of the consolidated fragment does not contain any empty cells.

The fourth criterion is the collective fragment size. Among all eligible fragment subsets for consolidation, we must first select to consolidate the ones that have the smallest sum of fragment sizes. This will quickly reduce the number of fragments (hence boosting read performance), without resorting to costly consolidation of larger fragments.
The final criterion is the number of fragments to consolidate in each step. This is controlled by sm.consolidation.step_min_frags and sm.consolidation.step_max_frags; the algorithm will select the subset of fragments (complying with all the above criteria) that has the maximum cardinality smaller than or equal to sm.consolidation.step_max_frags and larger than or equal to sm.consolidation.step_min_frags. If no fragment subset is eligible with cardinality at least sm.consolidation.step_min_frags, then the consolidation algorithm terminates.

The algorithm is based on dynamic programming and runs in time O(max_frags * total_frags), where total_frags is the total number of fragments considered in a given step, and max_frags is equal to the sm.consolidation.step_max_frags config parameter.

When computing the union of the non-empty domains of the fragments to be consolidated, in case there is at least one dense fragment, the union is always expanded to coincide with the space tile extents. This affects criterion 1 (since the expanded domain union may now overlap with some older fragments) and 2 (since the expanded union may amplify resulting consolidated fragment size).

Array metadata consolidation

Vacuuming

Vacuuming applies to consolidated fragments, consolidated array metadata and consolidated fragment metadata as follows:

Fragments: During consolidation, a .vac file is produced with all the fragment URIs that participated in consolidation. When the vacuuming function is called with mode "fragments", all the fragment folders whose URI is in the .vac file get deleted.
Array metadata: During consolidation, a .vac file is produced with all the array metadata URIs that participated in consolidation. When the vacuuming function is called with mode "array_meta", all the array metadata files whose URI is in the .vac file get deleted.
Fragment metadata: Vacuuming simply deletes all .meta files except for the last one.

Reading

TileDB supports fast and parallel subarray reads, with the option to time travel, i.e., to read the array at a particular time in the past. The read algorithm is architected to handle multiple fragments efficiently and completely transparently to the user. To read an array, TileDB first "opens" the array and brings some lightweight fragment metadata in main memory. Using this metadata, TileDB knows which fragments to ignore and which to focus on, e.g., based on whether their non-empty domain overlaps with the query subarray, or whether the fragment was created at or before the time of interest. Moreover, in case consolidation has occurred, TileDB will be smart enough to ignore fragments that have been consolidated, by considering only the merged fragment that encompasses them.

When reading an array, the user provides:

a (single- or multi-range) subarray
the attributes to slice on (it can be any subset of the attributes, including the coordinates)
the layout with respect to the subarray to return the result cells in

The read algorithm is quite involved. It leverages spatial indexing to locate only the relevant data tiles to the slice, it makes sure it does not fetch a data tile twice in the case of multi-range queries, it performs selective decompression of tile chunks after a tile has been fetched from the backend, and it employs parallelism pretty much everywhere (in IO, decompression, sorting, etc).

Dense arrays

The figure below shows how to read the values of a single attribute from a dense array. The ideas extend to multi-attribute arrays and slicing on any subset of the attributes, including even retrieving the explicit coordinates of the results. The figure shows retrieving the results in 3 different layouts, all with respect to the subarray query. This means that you can ask TileDB to return the results in an order that is different than the actual physical order (which, recall, is always the global order), depending on the needs of your application.

You can also submit multi-range subarrays, as shows in the figure below. The supported orders here are row-major, column-major and unordered. The latter gives no guarantees about the order; TileDB will attempt to process the query in the fastest possible way and return the results in an arbitrary order. It is recommended to use this layout if you target at performance and you do not care about the order of the results. Also you can ask TileDB to return the explicit coordinates of the returned values if you wish to know which value corresponds to which cell.

Note that reading dense arrays always returns dense results. This means that, if your subarray overlaps with empty (non-materialized) cells in the dense array, TileDB will return default or user-defined fill values for those cells. The figure below shows an example.

Recall that all cells in a dense fragment must have a value, which TileDB materializes on disk. This characteristic of dense fragments is important as it considerably simplifies spatial indexing, which becomes almost implicit. Consider the example in the figure below. Knowing the space tile extent along each dimension and the tile order, we can easily identify which space tiles intersect with a subarray query without maintaining any complicated index. Then, using lightweight bookkeeping (such as offsets of data tiles on disk, compressed data tile size, etc.), TileDB can fetch the tiles containing results from storage to main memory. Finally, knowing the cell order, it can locate each slab of contiguous cell results in constant time (again without extra indexing) and minimize the number of memory copy operations.

Note that the above ideas apply also to dense fragments that populate only a subset of the array domain; knowing the non-empty domain, TileDB can use similar arithmetic calculations to locate the overlapping tiles and cell results.

Sparse arrays

The figure below shows an example subarray query on a sparse array with a single attribute, where the query requests also the coordinates of the result cells. Similar to the case of dense arrays, the user can request the results in layouts that may be different from the physical layout of the cells in the array (global order).

Sparse arrays accept multi-range subarray queries as well. Similar to the dense case, global order is not applicable here, but instead an unordered layout is supported that returns the results in an arbitrary order (again, TileDB will try its best to return the results as fast as possible in this read mode).

A sparse fragment differs from a dense fragment in the following aspects:

A sparse fragment stores only non-empty cells that might appear in any position in the domain (i.e., they may not be concentrated in dense hyper-rectangles)
In sparse fragments there is no correspondence between space and data tiles. The data tiles are created by first sorting the cells on the global order, and then grouping adjacent cell values based on the tile capacity.
There is no way to know a priori the position of the non-empty cells, unless we maintain extra indexing information.
A sparse fragment materializes the coordinates of the non-empty cells in data tiles.

TileDB indexes sparse non-empty cells with R-Trees. Specifically, for every coordinate data tile it constructs the minimum bounding rectangle (MBR) using the coordinates in the tile. Then, it uses the MBRs of the data tiles as leaves and constructs an R-Tree bottom up by recursively grouping MBRs into larger MBRs using a fanout parameter. The figure below shows an example of a sparse fragment and its corresponding R-Tree.

Given a subarray query, the R-Tree (which is small enough to fit in main memory) is used to identify the intersecting data tile MBRs. Then, the qualifying coordinate data tiles are fetched and the materialized coordinates therein are used to determine the actual results.

Time traveling

Recall that writing to TileDB arrays produces a number of timestamped fragments. TileDB supports reading an array at an arbitrary instance in time, by providing a timestamp upon opening the array for reading. Any fragment created after that timestamp will be ignored and the read will produce results as if only the fragments created at or before the given timestamp existed in the array. Time traveling applies to both dense and sparse arrays. The figure below shows an example of a dense array with 3 fragments, along with the results of a subarray depending on the timestamp the array gets opened with.

In the case of consolidation, time traveling works as follows:

If the user opens the array at a timestamp that is larger than or equal to the second timestamp of a fragment name, then that fragment will be considered in the read.
If the user opens the array at a timestamp that is smaller than the second timestamp of a fragment, then that fragment will be ignored.
If a fragment that qualifies for reading has been consolidated into another fragment that is considered for reading, then it will be ignored.

Incomplete Queries

There are situations where the memory allocated by the user to hold the result size is not enough for a given query. Instead of erroring out, TileDB gracefully handles these cases by attempting to serve a portion of the query and report back with an "incomplete" query status. The user should then consume the returned result and resubmit the query, until the query returns a "complete" status. This is explained with code here. TileDB maintains all the necessary internal state inside the query object.

But what portion of the query is served in each iteration? TileDB implements the incomplete query functionality via result estimation and subarray partitioning. Specifically, if TileDB assesses (via estimation heuristics) that the query subarray leads to a larger result size than the allocated buffers, it splits (i.e., partitions) it appropriately, such that a smaller subarray (single- or multi-range) can be served. The challenge is in partitioning the subarray in a way that the result cell order (defined by the user) is respected across the incomplete query iterations. TileDB efficiently and correctly performs this partitioning process transparently from the user.

Caching

TileDB caches data and metadata upon read queries. More specifically, it caches:

Fragment metadata for those fragments that are relevant to a submitted query. There are no restrictions on how large that cache space is. The user can flush the fragment metadata cache by simply closing all open instances of an array.
Data tiles that overlap a subarray (across fragments). This cache space is configurable (see Configuration Parameters). The data tiles are currently cached in their raw "disk" form (i.e., with all potential filters applied as they are stored in the data files).

The use of caching can be quite beneficial especially if the data resides on cloud object stores like AWS S3.

Selective Decompression

TileDB implements additional optimizations that improve decompression times and the overall memory footprint of a read query. Recall that each data tile is further decomposed into chunks. After fetching a data tile that contains result candidates from storage, the TileDB read algorithm knows exactly which chunks of the tile are relevant to the query and decompresses (unfilters) only those chunks.

Aggregates

TileDB supports fast and parallel aggregation of results. Currently, the results can only be aggregated over the whole returned dataset, which this page will call the default channel. To add aggregates to a query, the first thing to do is to get the default channel. For count (nullary aggregate), no operations need to be created. For the other aggregates, an operation needs to be created on the desired column. That operation can then be applied to the default channel, whilst defining the output field name for the result (for count, there is a constant operation that can be used to apply). Finally, buffers to receive the aggregate result can be specified using the regular buffer APIs on the query (see Basic Reading).

Note that ranges and query conditions can still be used to limit the rows to aggregate. Also note that TileDB allows getting the data and computing aggregates simultaneously. To do so, it is only required to specify buffers for the desired columns at the same time as the aggregated results. Here, the result of the aggregation will be available once the query is in a completed state (see Incomplete Queries).

Finally, here is a list of supported operations and information about the supported input field data type and the output datatype.

Aggregate operation

Input field type

Output type

Aggregate operation

Operation name

Atomicity

In TileDB, reads, writes, consolidation and vacuuming are all atomic and will never lead to array corruption.

Reading

A read operation is the process of (i) creating a read query object and (ii) submitting the query (potentially multiple times in the case of incomplete queries) until the query is completed. Each such read operation is atomic and can never corrupt the state of an array.

Writing

A write operation is the process of (i) creating a write query object, (ii) submitting the query (potentially multiple times in the case of global writes) and (iii) finalizing the query object (important only in global writes). Each such write operation is atomic, i.e., a set of functions (which depends on the API) that must be treated atomically by each thread. For example, multiple threads should not submit the query for the same query object. Instead, you can have multiple threads create separate query objects for the same array (even sharing the same context or array object), and prepare and submit them in parallel with each thread.

A write operation either succeeds and creates a fragment that is visible to future reads, or it fails and any folder and file relevant to the failed fragment is entirely ignored by future reads. A fragment creation is successful if a file <fragment_name>.ok appears in the array folder for the created fragment <fragment_name>. There will never be the case that a fragment will be partially written and still accessible by the reader. The user just needs to eventually delete the partially written folder to save space (i.e., a fragment folder without an associated .ok file). Furthermore, each fragment is immutable, so there is no way for a write operation to corrupt another fragment created by another operation.

Consolidation

Consolidation entails a read and a write and, therefore, it is atomic in the same sense as for writing. There is no way for consolidation to lead to a corrupted array state.

Vacuuming

Vacuuming simply deletes fragment folders and array/fragment metadata files. Vacuuming always deletes the .ok files before proceeding to erasing the corresponding folders. It is atomic in the sense that it cannot lead to array corruption and if the vacuuming process is interrupted, it can be restarted without issues.

Concurrency

In addition to using parallelism internally, TileDB is designed having parallel programming in mind. Specifically, scientific computing users may be accustomed to using multi-processing (e.g., via MPI or Dask or Spark), or writing multi-threaded programs to speed up performance. TileDB enables concurrency using a multiple writer / multiple reader model that is entirely lock-free.

Writes

Concurrent writes are achieved by having each thread or process create one or more separate fragments for each write operation. No synchronization is needed across processes and no internal state is shared across threads among the write operations and, thus, no locking is necessary. Regarding the concurrent creation of the fragments, thread- and process-safety is achieved because each thread/process creates a fragment with a unique name (as it incorporates a UUID). Therefore, there are no conflicts even at the storage backend level.

TileDB supports lock-free concurrent writes of array metadata as well. Each write creates a separate array metadata file with a unique name (also incorporating a UUID), and thus name collisions are prevented.

Reads

During opening the array, TileDB loads the array schema and fragment metadata to main memory once, and shares them across all array objects referring to the same array. Therefore, for the multi-threading case, it is highly recommended that you open the array once outside the atomic block and have all threads create the query on the same array object. This is to prevent the scenario where a thread opens the array, then closes it before another thread opens the array again, and so on. TileDB internally employs a reference-count system, discarding the array schema and fragment metadata each time the array is closed and the reference count reaches zero (the schema and metadata are typically cached, but they still need to be deserialized in the above scenario). Having all concurrent queries use the same array object eliminates the above problem.

Reads in the multi-processing setting are completely independent and no locking is required. In the multi-threading scenario, locking is employed (through mutexes) only when the queries access the tile cache, which incurs a very small overhead.

Mixing reads and writes

Concurrent reads and writes can be arbitrarily mixed. Fragments are not visible unless the write query has been completed (and the .ok file appeared). Fragment-based writes make it so that reads simply see the logical view of the array without the new (incomplete) fragment. This multiple writers / multiple readers concurrency model of TileDB is more powerful than competing approaches, such as HDF5’s single writer / multiple readers (SWMR) model. This feature comes with a more relaxed consistency model, which is described in the Consistency section.

Consolidation

Consolidation can be performed in the background in parallel with and independently of other reads and writes. The new fragment that is being created is not visible to reads before consolidation is completed.

Vacuuming

Vacuuming deletes fragments that have been consolidated. Although it can never lead to a corrupted array state, it may lead to issues if there is a read operation that accesses a fragment that is being vacuumed. This is possible when the array is opened at a timestamp before some consolidation operation took place, therefore considering the fragment to be vacuumed. Most likely, that will lead to a segfault or some unexpected behavior.

TileDB locks the array upon vacuuming to prevent the above. This is achieved via mutexes in multi-threading, and file locking in multi-processing (for those storage backends that support it).

All POSIX-compliant filesystems and Windows filesystems support file locking. Note that Lustre supports POSIX file locking semantics and exposes local- (mount with -o localflock) and cluster- (mount with -o flock) level locking. Currently, TileDB does not use file locking on HDFS and S3 (these storage backends do not provide such functionality, but rather resource locking must be implemented as an external feature). For filesystems that do not support filelocking, the multi-processing programs are responsible for synchronizing the concurrent writes.

Particular care must be taken when vacuuming arrays on AWS S3 and HDFS. Without filelocking TileDB has no way to prevent vacuuming from deleting the old consolidated fragments. If another process is reading those fragments while consolidation is deleting them, the reading process is likely to error out or crash.

In general, avoid executing vacuuming when time traveling upon reading in cloud object stores. It is generally safe to vacuum if you are reading the array at the current timestamp.

Array Creation

Array creation (i.e., storing the array schema on persistent storage) is not thread-/process-safe. We do not expect a practical scenario where multiple threads/processes attempt to create the same array in parallel. We suggest that only one thread/process creates the array, before multiple threads/processes start working concurrently for writes and reads.

Consistency

TileDB enables concurrent writes and reads that can be arbitrarily mixed, without affecting the normal execution of a parallel program. This comes with a more relaxed consistency model, called eventual consistency. Informally, this guarantees that, if no new updates are made to an array, eventually all accesses to the array will “see” the last collective global view of the array (i.e., one that incorporates all the updates). Everything discussed in this section about array fragments is also applicable to array metadata.

We illustrate the concept of eventual consistency in the figure below (which is the same for both dense and sparse arrays). Suppose we perform two writes in parallel (by different threads or processes), producing two separate fragments. Assume also that there is a read at some point in time, which is also performed by a third thread/process (potentially in parallel with the writes). There are five possible scenarios regarding the logical view of the array at the time of the read (i.e., five different possible read query results). First, no write may have completed yet, therefore the read sees an empty array. Second, only the first write got completed. Third, only the second write got completed. Fourth, both writes got completed, but the first write was the one to create a fragment with an earlier timestamp than the second. Fifth, both writes got completed, but the second write was the one to create a fragment with an earlier timestamp than the first.

The concept of eventual consistency essentially tells you that, eventually (i.e., after all writes have completed), you will see the view of the array with all updates in. The order of the fragment creation will determine which cells are overwritten by others and, hence, greatly affects the final logical view of the array.

Eventual consistency allows high availability and concurrency. This model is followed by the AWS S3 object store and, thus, TileDB is ideal for integrating with such distributed storage backends. If strict consistency is required for some application (e.g., similar to that in transactional databases), then an extra layer must be built on top of TileDB Open Source to enforce additional synchronization.

But how does TileDB deal internally with consistency? This is where opening an array becomes important. When you open an array (at the current time or a time in the past), TileDB takes a snapshot of the already completed fragments. This the view of the array for all queries that will be using that opened array object. If writes happen (or get completed) after the array got opened, the queries will not see the new fragments. If you wish to see the new fragments, you will need to either open a new array object and use that one for the new queries, or reopen the array (reopening the array bypasses closing it first, permitting some performance optimizations).

We illustrate with the figure below. The first array depicts the logical view when opening the array. Next suppose a write occurs (after opening the array) that creates the fragment shown as the second array in the figure. If we attempt to read from the opened array, even after the new fragment creation, we will see the view of the third array in the figure. In other words, we will not see the updates that occurred between opening and reading from the array. If we'd like to read from the most up-to-date array view (fourth array in the figure), we will need to reopen the array after the creation of the fragment.

When you write to TileDB with multiple processes, if your application is the one to be synchronizing the writes across machines, make sure that the machine clocks are synchronized as well. This is because TileDB sorts the fragments based on the timestamp in their names, which is calculated based on the machine clock.

Here is how TileDB reads achieve eventual consistency on AWS S3:

Upon opening the array, list the fragments in the array folder
Consider only the fragments that have an associated .ok file (the ones that do not have one are either in progress or not visible due to S3’s eventual consistency)
The .ok file is PUT after all the fragment data and metadata files have been PUT in the fragment folder.
Any access inside the fragment folder is performed with a byte range GET request, never with LIST. Due to S3’s read-after-write consistency model, those GET requests are guaranteed to succeed.

The above practically tells you that a read operation will always succeed and never be corrupted (i.e., it will never have results from partially written fragments), but it will consider only the fragments that S3 makes visible (in their entirety) at the timestamp of opening the array.

Parallelism

TileDB is fully parallelized internally, i.e., it uses multiple threads to process in parallel the most heavyweight tasks.

We explain how TileDB parallelizes the read and write queries, outlining the configuration parameters that you can use to control the amount of parallelization. Note that here we cover only the most important areas, as TileDB parallelizes numerous other internal tasks. See Configuration Parameters and Configuration for a summary of the parameters and the way to set them respectively.

Reading

A read query mainly involves the following steps in this order:

Identifying the physical attribute data tiles that are relevant to the query (pruning the rest)
Performing parallel IO to retrieve those tiles from the storage backend.
Unfiltering the data tiles in parallel to get the raw cell values and coordinates.
Performing a refining step to get the actual results and organize them in the query layout.

TileDB parallelizes all steps, but here we discuss mainly steps (2) and (3) that are the most heavyweight.

Parallel IO

TileDB reads the relevant tiles from all attributes to be read in parallel as follows:

parallel_for_each attribute being read:
  parallel_for_each relevant tile of the attribute read:
    prepare a byte range to be read (IO task)

TileDB computes the byte ranges required to be fetched from each attribute file. Those byte ranges might be disconnected and could be numerous especially in the case of multi-range subarrays. In order to reduce the latency of the IO requests (especially on S3), TileDB attempts to merge byte ranges that are close to each other and dispatch fewer larger IO requests instead of numerous smaller ones. More specifically, TileDB merges two byte ranges if their gap size is not bigger than vfs.min_batch_gap and their resulting size is not bigger than vfs.min_batch_size. Then, each byte range (always corresponding to the same attribute file) becomes an IO task. These IO tasks are dispatched for concurrent execution, where the maximum level of concurrency is controlled by the sm.io_concurrency_level parameter.

TileDB may further partition each byte range to be fetched based on parameters vfs.file.max_parallel_ops (for posix and Windows), vfs.s3.max_parallel_ops(for S3) and vfs.min_parallel_size. Those partitions are then read in parallel. Currently, the maximum parallel operations for HDFS is set to 1, i.e., this task parallelization step does not apply to HDFS.

Parallel Tile Unfiltering

Once the relevant data tiles are in main memory, TileDB "unfilters" them (i.e., runs the filters applied during writes in reverse) in parallel in a nested manner as follows:

for_each attribute being read:
  parallel_for_each tile of the attribute read:
    parallel_for_each chunk of the tile:
      unfilter chunk

The “chunks” of a tile are controlled by a TileDB filter list parameter that defaults to 64KB.

The sm.compute_concurrency_level parameter impacts the for loops above, although it is not recommended to modify this configuration parameter from its default setting. The nested parallelism in reads allows for maximum utilization of the available cores for filtering (e.g. decompression), in either the case where the query intersects few large tiles or many small tiles.

Writing

A write query mainly involves the following steps in this order:

Re-organizing the cells in the global cell order and into attribute data tiles.
Filtering the attribute data tiles to be written.
Performing parallel IO to write those tiles to the storage backend.

TileDB parallelizes all steps, but here we discuss mainly steps (2) and (3) that are the most heavyweight.

Parallel Tile Filtering

For writes TileDB uses a similar strategy as for reads:

parallel_for_each attribute being written:
  for_each tile of the attribute being written:
    parallel_for_each chunk of the tile:
      filter chunk

Similar to reads, the sm.compute_concurrency_level parameter impacts the for loops above, although it is not recommended to modify this configuration parameter from its default setting.

Parallel IO

Similar to reads, IO tasks are created for each tile of every attribute. These IO tasks are dispatched for concurrent execution, where the maximum level of concurrency is controlled by the sm.io_concurrency_level parameter. For HDFS, this is the only parallelization TileDB provides for writes. For the other backends, TileDB parallelizes the writes further.

For POSIX and Windows, if a data tile is large enough, the VFS layer partitions the tile based on configuration parameters vfs.file.max_parallel_ops and vfs.min_parallel_size. Those partitions are then written in parallel using the VFS thread pool, whose size is controlled by vfs.io_concurrency.

For S3, TileDB buffers potentially several tiles and issues parallel multipart upload requests to S3. The size of the buffer is equal to vfs.s3.max_parallel_ops * vfs.s3.multipart_part_size. When the buffer is filled, TileDB issues vfs.s3.max_parallel_opsparallel multipart upload requests to S3.

Asynchronous Queries

Default read/write queries in TileDB are synchronous or blocking. This means that the user function that is submitting the query has to block and wait until TileDB is done processing the query. There are scenarios in which you may want to submit the query in an asynchronous or non-blocking fashion. In other words, you may wish to submit the query but tell TileDB to process it in the background, while you proceed with the execution of your function and perform other tasks while TileDB is executing the query in parallel. TileDB supports asynchronous queries and enables you to check the query status (e.g., if it is still in progress). It also allows to pass a callback upon submission, i.e., specify a function that you wish TileDB to compute upon finishing processing the query. This applies to both dense and sparse arrays, as well as to both write and read queries.

The figure below shows the difference between synchronous and asynchronous query execution.

TileDB allocates a separate thread pool for asynchronous queries, whose size is controlled by configuration parameter sm.num_async_threads (defaulting to 1).

Datetimes

TileDB supports datetime values for attributes and array domains. The representation used by TileDB follows the design of Numpy’s np.datetime64 datatype.

Values for the datetime types are internally stored and manipulated as int64 values. From the perspective of core TileDB internally, the datetime datatypes are simply aliases for TILEDB_INT64.

The meaning of an integral datetime value depends on three things:

A reference date. TileDB fixes this to the UNIX epoch time (1970-01-01 at 12:00 am). This is not currently configurable.
A unit of time. For example: day, month, hour, or nanosecond.
An integer value. This the integer number of time units relative to the reference date.

For example, a value of 10 for the type TILEDB_DATETIME_DAY refers to 12:00 am on 1970-01-10. A value of -18 for the type TILEDB_DATETIME_HR refers to 6:00 am on 1969-12-31, or 1969-12-31T06:00Z in ISO8601 format.

This means that each date unit of datetime is capable of representing a different range of dates at different resolutions. The following table (values taken from the Numpy np.datetime64 documentation) summarizes each date unit’s relative and absolute range:

Encryption

TileDB allows you to encrypt your arrays at rest. It currently supports a single type of encryption, AES-256 in the GCM mode, which is a symmetric, authenticated encryption algorithm. When creating, reading or writing arrays you must provide the same 256-bit encryption key. The authenticated nature of the encryption scheme means that a message authentication code (MAC) is stored together with the encrypted data, allowing verification that the persisted ciphertext was not modified.

Encryption libraries used:

macOS and Linux: OpenSSL
Windows: Next generation cryptography (CNG)

By default, TileDB caches array data and metadata in main memory after opening and reading from arrays. These caches will store decrypted (plaintext) array data in the case of encrypted arrays. For a bit of extra in-flight security (at the cost of performance), you can disable the TileDB caches (see Configuration Parameters and Configuration).

Encryption key lifetime

TileDB never persists the encryption key, but TileDB does store a copy of the encryption key in main memory while an encrypted array is open. When the array is closed, TileDB will zero out the memory used to store its copy of the key, and free the associated memory.

Performance

Due to the extra processing required to encrypt and decrypt array metadata and attribute data, you may experience lower performance on opening, reading and writing for encrypted arrays.

To mitigate this, TileDB internally parallelizes encryption and decryption using a chunking strategy. Additionally, when compression or other filtering is configured on array metadata or attribute data, encryption occurs last, meaning the compressed (or filtered in general) is what gets encrypted.

Finally, newer generations of some Intel and AMD processors offer instructions for hardware acceleration of encryption and decryption. The encryption libraries that TileDB employs are configured to use hardware acceleration if it is available.

Reading

When reading an array, the user provides:

a (single- or multi-range) subarray
the attributes to slice on (it can be any subset of the attributes, including the coordinates)
the layout with respect to the subarray to return the result cells in

Dense arrays

Sparse arrays

A sparse fragment differs from a dense fragment in the following aspects:

A sparse fragment stores only non-empty cells that might appear in any position in the domain (i.e., they may not be concentrated in dense hyper-rectangles)
In sparse fragments there is no correspondence between space and data tiles. The data tiles are created by first sorting the cells on the global order, and then grouping adjacent cell values based on the tile capacity.
There is no way to know a priori the position of the non-empty cells, unless we maintain extra indexing information.
A sparse fragment materializes the coordinates of the non-empty cells in data tiles.

Time traveling

In the case of consolidation, time traveling works as follows:

If the user opens the array at a timestamp that is larger than or equal to the second timestamp of a fragment name, then that fragment will be considered in the read.
If the user opens the array at a timestamp that is smaller than the second timestamp of a fragment, then that fragment will be ignored.
If a fragment that qualifies for reading has been consolidated into another fragment that is considered for reading, then it will be ignored.

Incomplete Queries

Caching

TileDB caches data and metadata upon read queries. More specifically, it caches:

Fragment metadata for those fragments that are relevant to a submitted query. There are no restrictions on how large that cache space is. The user can flush the fragment metadata cache by simply closing all open instances of an array.
Data tiles that overlap a subarray (across fragments). This cache space is configurable (see Configuration Parameters). The data tiles are currently cached in their raw "disk" form (i.e., with all potential filters applied as they are stored in the data files).

The use of caching can be quite beneficial especially if the data resides on cloud object stores like AWS S3.

Selective Decompression

Consolidation

To mitigate this problem, TileDB has a consolidation feature, which allows you to merge

Lightweight fragment metadata footers into a single file.
A subset of fragments into a single fragment.
A subset of array metadata files into a single one.

Fragment metadata consolidation

Fragment consolidation

If mode fragments is passed to the consolidation function, then the fragment consolidation algorithms is executed, which is explained in detail below.

Basic concepts

There are two important points to stress regarding fragment consolidation:

Consolidating dense fragments produces a dense fragment, and may induce fill values.
Consolidating fragments where all fragments are sparse produces a sparse fragment.

Recall that each fragment is associated with its creation timestamp upon writing. A consolidated fragment instead is associated with the timestamp range that includes the timestamps of the fragments that produced it (see ). This is particularly important for time traveling, since opening an array at a timestamp will consider all the consolidated fragments whose end timestamp is at or before the query timestamp. In other words, although consolidation generally leads to better performance, it affects the granularity of time traveling.

Preprocessing

Algorithm

Let us focus on a single step, during which the algorithm must select and consolidate a subset of fragments based on certain criteria:

The first criterion is if a subset of fragments is “consolidatable”, i.e., eligible for consolidation in a way that does not violate correctness. Any subset consisting of solely sparse fragments is always consolidatable. However, if a fragment subset contains one or more dense fragments, TileDB performs an important check; if the union of the non-empty domains of the fragments (which is equal to the non-empty domain of the resulting consolidated fragment) overlaps with any fragment created prior to this subset, then the subset is marked as non-consolidatable. Recall that the fragment that results from consolidating a subset of fragments containing at least one dense fragment is always a dense fragment. Therefore, empty regions in the non-emtpy domain of the consolidated fragment will be filled with special values. Those values may erroneously overwrite older valid cell values. Such a scenario is illustrated in the figure below. The second and third fragments are not consolidatable, since their non-empty domain contains empty regions that overlap with the first (older) fragment. Consequently, consolidating the second and third fragment results in a logical view that is not identical to the one before consolidation, violating correctness. This criterion detects and prevents such cases.

The second criterion is the comparative fragment size. Ideally, we must consolidate fragments of approximately equal size. Otherwise, we may end up in a situation where, for example, a 100GB fragment gets consolidated with a 1MB one, which would unnecessarily waste consolidation time. This is controlled by parameter sm.consolidation.step_size_ratio; if the size ratio of two adjacent fragments is smaller than this parameter, then no fragment subset that contains those two fragments will be considered for consolidation.
The third criterion is the fragment amplification factor, applicable to the case where the fragment subset to be consolidated contains at least one dense fragment. If the non-empty domain of the resulting fragment has too many empty cells, its size may become considerably larger than the sum of sizes of the original fragments to be consolidated. This is because the consolidated fragment is dense and inserts special fill values for all empty cells in its non-empty domain (see figure below). The amplification factor is the ratio between the consolidated fragment size and the sum of sizes of the original fragments. This is controlled by sm.consolidation.amplification, which should not be exceed for a fragment subset to be eligible for consolidation. The default value 1.0 means that the fragments will be consolidated if there is no amplification at all, i.e., if the size of the resulting consolidated fragment is smaller than or equal to the sum of sizes of the original fragments. As an example, this happens when the non-empty domain of the consolidated fragment does not contain any empty cells.

The fourth criterion is the collective fragment size. Among all eligible fragment subsets for consolidation, we must first select to consolidate the ones that have the smallest sum of fragment sizes. This will quickly reduce the number of fragments (hence boosting read performance), without resorting to costly consolidation of larger fragments.
The final criterion is the number of fragments to consolidate in each step. This is controlled by sm.consolidation.step_min_frags and sm.consolidation.step_max_frags; the algorithm will select the subset of fragments (complying with all the above criteria) that has the maximum cardinality smaller than or equal to sm.consolidation.step_max_frags and larger than or equal to sm.consolidation.step_min_frags. If no fragment subset is eligible with cardinality at least sm.consolidation.step_min_frags, then the consolidation algorithm terminates.

Array metadata consolidation

Similar to array fragments, can also be consolidated (with mode array_meta). Since the array metadata is typically small and can fit in main-memory, consolidating them is rather simple. TileDB simply reads all the array metadata (from all the existing array metadata fragments) in main memory, creates an up-to-date view of the metadata, and then flushes them to a new array metadata file that carries in its name the timestamp range determined by the first timestamp of the first array metadata and the second timestamp of the last array metadata files that got consolidated.

Vacuuming

Vacuuming applies to consolidated fragments, consolidated array metadata and consolidated fragment metadata as follows:

Fragments: During consolidation, a .vac file is produced with all the fragment URIs that participated in consolidation. When the vacuuming function is called with mode "fragments", all the fragment folders whose URI is in the .vac file get deleted.
Array metadata: During consolidation, a .vac file is produced with all the array metadata URIs that participated in consolidation. When the vacuuming function is called with mode "array_meta", all the array metadata files whose URI is in the .vac file get deleted.
Fragment metadata: Vacuuming simply deletes all .meta files except for the last one.