TileDB supports fast and parallel subarray reads, with the option to time travel, i.e., to read the array at a particular time in the past. The read algorithm is architected to handle multiple fragments efficiently and completely transparently to the user. To read an array, TileDB first "opens" the array and brings some lightweight fragment metadata in main memory. Using this metadata, TileDB knows which fragments to ignore and which to focus on, e.g., based on whether their non-empty domain overlaps with the query subarray, or whether the fragment was created at or before the time of interest. Moreover, in case consolidation has occurred, TileDB will be smart enough to ignore fragments that have been consolidated, by considering only the merged fragment that encompasses them.
When reading an array, the user provides:
a (single- or multi-range) subarray
the attributes to slice on (it can be any subset of the attributes, including the coordinates)
the layout with respect to the subarray to return the result cells in
The read algorithm is quite involved. It leverages spatial indexing to locate only the relevant data tiles to the slice, it makes sure it does not fetch a data tile twice in the case of multi-range queries, it performs selective decompression of tile chunks after a tile has been fetched from the backend, and it employs parallelism pretty much everywhere (in IO, decompression, sorting, etc).
The figure below shows how to read the values of a single attribute from a dense array. The ideas extend to multi-attribute arrays and slicing on any subset of the attributes, including even retrieving the explicit coordinates of the results. The figure shows retrieving the results in 3 different layouts, all with respect to the subarray query. This means that you can ask TileDB to return the results in an order that is different than the actual physical order (which, recall, is always the global order), depending on the needs of your application.
You can also submit multi-range subarrays, as shows in the figure below. The supported orders here are row-major, column-major and unordered. The latter gives no guarantees about the order; TileDB will attempt to process the query in the fastest possible way and return the results in an arbitrary order. It is recommended to use this layout if you target at performance and you do not care about the order of the results. Also you can ask TileDB to return the explicit coordinates of the returned values if you wish to know which value corresponds to which cell.
Note that reading dense arrays always returns dense results. This means that, if your subarray overlaps with empty (non-materialized) cells in the dense array, TileDB will return default or user-defined fill values for those cells. The figure below shows an example.
Recall that all cells in a dense fragment must have a value, which TileDB materializes on disk. This characteristic of dense fragments is important as it considerably simplifies spatial indexing, which becomes almost implicit. Consider the example in the figure below. Knowing the space tile extent along each dimension and the tile order, we can easily identify which space tiles intersect with a subarray query without maintaining any complicated index. Then, using lightweight bookkeeping (such as offsets of data tiles on disk, compressed data tile size, etc.), TileDB can fetch the tiles containing results from storage to main memory. Finally, knowing the cell order, it can locate each slab of contiguous cell results in constant time (again without extra indexing) and minimize the number of memory copy operations.
Note that the above ideas apply also to dense fragments that populate only a subset of the array domain; knowing the non-empty domain, TileDB can use similar arithmetic calculations to locate the overlapping tiles and cell results.
The figure below shows an example subarray query on a sparse array with a single attribute, where the query requests also the coordinates of the result cells. Similar to the case of dense arrays, the user can request the results in layouts that may be different from the physical layout of the cells in the array (global order).
Sparse arrays accept multi-range subarray queries as well. Similar to the dense case, global order is not applicable here, but instead an unordered layout is supported that returns the results in an arbitrary order (again, TileDB will try its best to return the results as fast as possible in this read mode).
A sparse fragment differs from a dense fragment in the following aspects:
A sparse fragment stores only non-empty cells that might appear in any position in the domain (i.e., they may not be concentrated in dense hyper-rectangles)
In sparse fragments there is no correspondence between space and data tiles. The data tiles are created by first sorting the cells on the global order, and then grouping adjacent cell values based on the tile capacity.
There is no way to know a priori the position of the non-empty cells, unless we maintain extra indexing information.
A sparse fragment materializes the coordinates of the non-empty cells in data tiles.
TileDB indexes sparse non-empty cells with R-Trees. Specifically, for every coordinate data tile it constructs the minimum bounding rectangle (MBR) using the coordinates in the tile. Then, it uses the MBRs of the data tiles as leaves and constructs an R-Tree bottom up by recursively grouping MBRs into larger MBRs using a fanout parameter. The figure below shows an example of a sparse fragment and its corresponding R-Tree.
Given a subarray query, the R-Tree (which is small enough to fit in main memory) is used to identify the intersecting data tile MBRs. Then, the qualifying coordinate data tiles are fetched and the materialized coordinates therein are used to determine the actual results.
Recall that writing to TileDB arrays produces a number of timestamped fragments. TileDB supports reading an array at an arbitrary instance in time, by providing a timestamp upon opening the array for reading. Any fragment created after that timestamp will be ignored and the read will produce results as if only the fragments created at or before the given timestamp existed in the array. Time traveling applies to both dense and sparse arrays. The figure below shows an example of a dense array with 3 fragments, along with the results of a subarray depending on the timestamp the array gets opened with.
In the case of consolidation, time traveling works as follows:
If the user opens the array at a timestamp that is larger than or equal to the second timestamp of a fragment name, then that fragment will be considered in the read.
If the user opens the array at a timestamp that is smaller than the second timestamp of a fragment, then that fragment will be ignored.
If a fragment that qualifies for reading has been consolidated into another fragment that is considered for reading, then it will be ignored.
There are situations where the memory allocated by the user to hold the result size is not enough for a given query. Instead of erroring out, TileDB gracefully handles these cases by attempting to serve a portion of the query and report back with an "incomplete" query status. The user should then consume the returned result and resubmit the query, until the query returns a "complete" status. This is explained with code here. TileDB maintains all the necessary internal state inside the query object.
But what portion of the query is served in each iteration? TileDB implements the incomplete query functionality via result estimation and subarray partitioning. Specifically, if TileDB assesses (via estimation heuristics) that the query subarray leads to a larger result size than the allocated buffers, it splits (i.e., partitions) it appropriately, such that a smaller subarray (single- or multi-range) can be served. The challenge is in partitioning the subarray in a way that the result cell order (defined by the user) is respected across the incomplete query iterations. TileDB efficiently and correctly performs this partitioning process transparently from the user.
TileDB caches data and metadata upon read queries. More specifically, it caches:
Fragment metadata for those fragments that are relevant to a submitted query. There are no restrictions on how large that cache space is. The user can flush the fragment metadata cache by simply closing all open instances of an array.
Data tiles that overlap a subarray (across fragments). This cache space is configurable (see Configuration Parameters). The data tiles are currently cached in their raw "disk" form (i.e., with all potential filters applied as they are stored in the data files).
The use of caching can be quite beneficial especially if the data resides on cloud object stores like AWS S3.
TileDB implements additional optimizations that improve decompression times and the overall memory footprint of a read query. Recall that each data tile is further decomposed into chunks. After fetching a data tile that contains result candidates from storage, the TileDB read algorithm knows exactly which chunks of the tile are relevant to the query and decompresses (unfilters) only those chunks.