Parallelism

TileDB is fully parallelized internally, i.e., it uses multiple threads to process in parallel the most heavyweight tasks.

We explain how TileDB parallelizes the read and write queries, outlining the configuration parameters that you can use to control the amount of parallelization. Note that here we cover only the most important areas, as TileDB parallelizes numerous other internal tasks. See Configuration Parameters and Configuration for a summary of the parameters and the way to set them respectively.

Reading

A read query mainly involves the following steps in this order:

  1. Identifying the physical attribute data tiles that are relevant to the query (pruning the rest)

  2. Performing parallel IO to retrieve those tiles from the storage backend.

  3. Unfiltering the data tiles in parallel to get the raw cell values and coordinates.

  4. Performing a refining step to get the actual results and organize them in the query layout.

TileDB parallelizes all steps, but here we discuss mainly steps (2) and (3) that are the most heavyweight.

Parallel IO

TileDB reads the relevant tiles from all attributes to be read in parallel as follows:

parallel_for_each attribute being read:
parallel_for_each relevant tile of the attribute read:
prepare a byte range to be read (IO task)

TileDB computes the byte ranges required to be fetched from each attribute file. Those byte ranges might be disconnected and could be numerous especially in the case of multi-range subarrays. In order to reduce the latency of the IO requests (especially on S3), TileDB attempts to merge byte ranges that are close to each other and dispatch fewer larger IO requests instead of numerous smaller ones. More specifically, TileDB merges two byte ranges if their gap size is not bigger than vfs.min_batch_gap and their resulting size is not bigger than vfs.min_batch_size. Then, each byte range (always corresponding to the same attribute file) becomes an IO task. These IO tasks are dispatched for concurrent execution, where the maximum level of concurrency is controlled by the sm.io_concurrency_level parameter.

TileDB may further partition each byte range to be fetched based on parameters vfs.file.max_parallel_ops (for posix and Windows), vfs.s3.max_parallel_ops(for S3) and vfs.min_parallel_size. Those partitions are then read in parallel. Currently, the maximum parallel operations for HDFS is set to 1, i.e., this task parallelization step does not apply to HDFS.

Parallel Tile Unfiltering

Once the relevant data tiles are in main memory, TileDB "unfilters" them (i.e., runs the filters applied during writes in reverse) in parallel in a nested manner as follows:

for_each attribute being read:
parallel_for_each tile of the attribute read:
parallel_for_each chunk of the tile:
unfilter chunk

The “chunks” of a tile are controlled by a TileDB filter list parameter that defaults to 64KB.

The sm.compute_concurrency_level parameter impacts the for loops above, although it is not recommended to modify this configuration parameter from its default setting. The nested parallelism in reads allows for maximum utilization of the available cores for filtering (e.g. decompression), in either the case where the query intersects few large tiles or many small tiles.

Writing

A write query mainly involves the following steps in this order:

  1. Re-organizing the cells in the global cell order and into attribute data tiles.

  2. Filtering the attribute data tiles to be written.

  3. Performing parallel IO to write those tiles to the storage backend.

TileDB parallelizes all steps, but here we discuss mainly steps (2) and (3) that are the most heavyweight.

Parallel Tile Filtering

For writes TileDB uses a similar strategy as for reads:

parallel_for_each attribute being written:
for_each tile of the attribute being written:
parallel_for_each chunk of the tile:
filter chunk

Similar to reads, the sm.compute_concurrency_level parameter impacts the for loops above, although it is not recommended to modify this configuration parameter from its default setting.

Parallel IO

Similar to reads, IO tasks are created for each tile of every attribute. These IO tasks are dispatched for concurrent execution, where the maximum level of concurrency is controlled by the sm.io_concurrency_level parameter. For HDFS, this is the only parallelization TileDB provides for writes. For the other backends, TileDB parallelizes the writes further.

For POSIX and Windows, if a data tile is large enough, the VFS layer partitions the tile based on configuration parameters vfs.file.max_parallel_ops and vfs.min_parallel_size. Those partitions are then written in parallel using the VFS thread pool, whose size is controlled by vfs.io_concurrency.

For S3, TileDB buffers potentially several tiles and issues parallel multipart upload requests to S3. The size of the buffer is equal to vfs.s3.max_parallel_ops * vfs.s3.multipart_part_size. When the buffer is filled, TileDB issues vfs.s3.max_parallel_opsparallel multipart upload requests to S3.