TileDB supports fast and parallel subarray reads, with the option to time travel, i.e., to read the array at a particular time in the past. The read algorithm is architected to handle multiple fragments efficiently and completely transparently to the user. To read an array, TileDB first "opens" the array and brings some lightweight fragment metadata in main memory. Using this metadata, TileDB knows which fragments to ignore and which to focus on, e.g., based on whether their non-empty domain overlaps with the query subarray, or whether the fragment was created at or before the time of interest. Moreover, in case consolidation has occurred, TileDB will be smart enough to ignore fragments that have been consolidated, by considering only the merged fragment that encompasses them.
When reading an array, the user provides:
a (single- or multi-range) subarray
the attributes to slice on (it can be any subset of the attributes, including the coordinates)
the layout with respect to the subarray to return the result cells in
The read algorithm is quite involved. It leverages spatial indexing to locate only the relevant data tiles to the slice, it makes sure it does not fetch a data tile twice in the case of multi-range queries, it performs selective decompression of tile chunks after a tile has been fetched from the backend, and it employs parallelism pretty much everywhere (in IO, decompression, sorting, etc).
The figure below shows how to read the values of a single attribute from a dense array. The ideas extend to multi-attribute arrays and slicing on any subset of the attributes, including even retrieving the explicit coordinates of the results. The figure shows retrieving the results in 3 different layouts, all with respect to the subarray query. This means that you can ask TileDB to return the results in an order that is different than the actual physical order (which, recall, is always the global order), depending on the needs of your application.
You can also submit multi-range subarrays, as shows in the figure below. The supported orders here are row-major, column-major and unordered. The latter gives no guarantees about the order; TileDB will attempt to process the query in the fastest possible way and return the results in an arbitrary order. It is recommended to use this layout if you target at performance and you do not care about the order of the results. Also you can ask TileDB to return the explicit coordinates of the returned values if you wish to know which value corresponds to which cell.
Note that reading dense arrays always returns dense results. This means that, if your subarray overlaps with empty (non-materialized) cells in the dense array, TileDB will return default or user-defined fill values for those cells. The figure below shows an example.
Recall that all cells in a dense fragment must have a value, which TileDB materializes on disk. This characteristic of dense fragments is important as it considerably simplifies spatial indexing, which becomes almost implicit. Consider the example in the figure below. Knowing the space tile extent along each dimension and the tile order, we can easily identify which space tiles intersect with a subarray query without maintaining any complicated index. Then, using lightweight bookkeeping (such as offsets of data tiles on disk, compressed data tile size, etc.), TileDB can fetch the tiles containing results from storage to main memory. Finally, knowing the cell order, it can locate each slab of contiguous cell results in constant time (again without extra indexing) and minimize the number of memory copy operations.
Note that the above ideas apply also to dense fragments that populate only a subset of the array domain; knowing the non-empty domain, TileDB can use similar arithmetic calculations to locate the overlapping tiles and cell results.
The figure below shows an example subarray query on a sparse array with a single attribute, where the query requests also the coordinates of the result cells. Similar to the case of dense arrays, the user can request the results in layouts that may be different from the physical layout of the cells in the array (global order).
Sparse arrays accept multi-range subarray queries as well. Similar to the dense case, global order is not applicable here, but instead an unordered layout is supported that returns the results in an arbitrary order (again, TileDB will try its best to return the results as fast as possible in this read mode).
A sparse fragment differs from a dense fragment in the following aspects:
A sparse fragment stores only non-empty cells that might appear in any position in the domain (i.e., they may not be concentrated in dense hyper-rectangles)
In sparse fragments there is no correspondence between space and data tiles. The data tiles are created by first sorting the cells on the global order, and then grouping adjacent cell values based on the tile capacity.
There is no way to know a priori the position of the non-empty cells, unless we maintain extra indexing information.
A sparse fragment materializes the coordinates of the non-empty cells in data tiles.
TileDB indexes sparse non-empty cells with R-Trees. Specifically, for every coordinate data tile it constructs the minimum bounding rectangle (MBR) using the coordinates in the tile. Then, it uses the MBRs of the data tiles as leaves and constructs an R-Tree bottom up by recursively grouping MBRs into larger MBRs using a fanout parameter. The figure below shows an example of a sparse fragment and its corresponding R-Tree.
Given a subarray query, the R-Tree (which is small enough to fit in main memory) is used to identify the intersecting data tile MBRs. Then, the qualifying coordinate data tiles are fetched and the materialized coordinates therein are used to determine the actual results.
Recall that writing to TileDB arrays produces a number of timestamped fragments. TileDB supports reading an array at an arbitrary instance in time, by providing a timestamp upon opening the array for reading. Any fragment created after that timestamp will be ignored and the read will produce results as if only the fragments created at or before the given timestamp existed in the array. Time traveling applies to both dense and sparse arrays. The figure below shows an example of a dense array with 3 fragments, along with the results of a subarray depending on the timestamp the array gets opened with.
In the case of consolidation, time traveling works as follows:
If the user opens the array at a timestamp that is larger than or equal to the second timestamp of a fragment name, then that fragment will be considered in the read.
If the user opens the array at a timestamp that is smaller than the second timestamp of a fragment, then that fragment will be ignored.
If a fragment that qualifies for reading has been consolidated into another fragment that is considered for reading, then it will be ignored.