Spatial indexing is vital for efficiently slicing subarrays along the dimensions, which is the most important operation in TileDB. Although TileDB offers a unified array API to the user for both dense and sparse arrays, it transparently employs different spatial indexing techniques for dense and sparse fragments.
Recall that all cells in a dense fragment must have a value, which TileDB materializes on disk. This characteristic of dense fragments is important as it considerably simplifies spatial indexing, which becomes almost implicit. Consider the example in the figure below. Knowing the space tile extent along each dimension and the tile order, we can easily identify which space tiles intersect with a subarray query without maintaining any complicated index. Then, using lightweight booking (such as offsets of data tiles on disk, compressed data tile size, etc.), we can fetch the tiles containing results from storage to main memory. Finally, knowing the cell order, we can locate each slab of contiguous cell results in constant time, again without extra indexing.
Note that the above ideas apply also to dense fragments that populate only a subset of the array domain; knowing the non-empty domain, TileDB can apply similar arithmetic calculations to locate the overlapping tiles and cell results.
Recall the differences of a sparse fragment from a dense fragment:
A sparse fragment stores only non-empty cells that might appear in any position in the domain (i.e., they may not be concentrated in dense hyper-rectangles)
In sparse fragments there is no correspondence between space and data tiles. The data tiles are created by first sorting the cells on the global order, and then grouping adjacent cell values based on the tile capacity.
There is no way to know a priori the position of the non-empty cells, unless we maintain extra indexing information.
A sparse fragment materializes the coordinates of the non-empty cells in data tiles.
TileDB indexes sparse non-empty cells with R-Trees. Specifically, for every coordinate data tile it constructs the minimum bounding rectangle (MBR) using the coordinates in the tile. Then, it uses the MBRs of the data tiles as leaves and constructs an R-Tree bottom up by recursively grouping MBRs into larger MBRs using a fanout parameter. The figure below shows an example of a sparse fragment and its corresponding R-Tree.
Given a subarray query, the R-Tree (which is small enough to fit in main memory) is used to identify the intersecting data tile MBRs. Then, the qualifying coordinate data tiles are fetched and the materialized coordinates therein are used to determine the actual results.