Parallelism

TileDB is fully parallelized internally, i.e., it uses multiple threads to process in parallel the most heavyweight tasks.

We explain how TileDB parallelizes the read and write queries, outlining the configuration parameters that you can use to control the amount of parallelization. Note that here we cover only the most important areas, as TileDB parallelizes numerous other internal tasks. See Configuration Parameters and Configuration for a summary of the parameters and the way to set them respectively.

Reading

A read query mainly involves the following steps in this order:

  1. Identifying the physical attribute data tiles that are relevant to the query (pruning the rest)

  2. Performing parallel IO to retrieve those tiles from the storage backend.

  3. Filtering the data tiles in parallel to get the raw cell values and coordinates.

  4. Performing a refining step to get the actual results and organize them in the query layout.

TileDB parallelizes all steps, but here we discuss mainly steps (2) and (3) that are the most heavyweight.

Parallel IO

For flexibility in controlling the number of threads used for these parallel calls into the virtual filesystem (VFS) layer that abstracts all storage backends, TileDB does not use TBB for these operations. Instead, a list of read tasks is accumulated during the query and dispatched to a thread pool with a work queue. The size of the thread pool is determined by configuration parameter sm.num_reader_threads(defaulting to 1).

TileDB computes the byte ranges required to be fetched from each attribute file. Those byte ranges might be disconnected and could be numerous especially in the case of multi-range subarrays. In order to reduce the latency of the IO requests (especially on S3), TileDB attempts to merge byte ranges that are close to each other and dispatch fewer larger IO requests instead of numerous smaller ones. More specifically, TileDB merges two byte ranges if their gap size is not bigger than vfs.min_batch_gap and their resulting size is not bigger than vfs.min_batch_size. Then, each byte range (always corresponding to the same attribute file) becomes an IO task and gets added to the thread pool queue to be performed in parallel by the next available thread.

Each of those IO tasks, depending on its size, can be further parallelized using another thread pool at the VFS layer. The size of the pool is controlled by vfs.num_threads(defaulting to 1). TileDB partitions the byte range of the task based on parameters vfs.file.max_parallel_ops (for posix and Windows), vfs.s3.max_parallel_ops(for S3) and vfs.min_parallel_size. Those partitions are then read in parallel. Currently, the maximum parallel operations for HDFS is set to 1, i.e., this task parallelization step does not apply to HDFS.

Parallel Tile Filtering

Once the relevant data tiles are in main memory, TileDB filters them in parallel in a nested manner with TBB as follows:

parallel_for_each attribute being read:
parallel_for_each tile of the attribute overlapping the subarray:
parallel_for_each chunk of the tile:
filter chunk

The “chunks” of a tile are controlled by a TileDB filter list parameter that defaults to 64KB.

The sm.num_tbb_threads parameter impacts the for loops above, although it is not recommended to modify this configuration parameter from its default setting. The nested parallelism in reads allows for maximum utilization of the available cores for filtering (e.g. decompression), in either the case where the query intersects few large tiles or many small tiles.

Writing

A write query mainly involves the following steps in this order:

  1. Re-organizing the cells in the global cell order and into attribute data tiles.

  2. Filtering the attribute data tiles to be written.

  3. Performing parallel IO to write those tiles to the storage backend.

TileDB parallelizes all steps, but here we discuss mainly steps (2) and (3) that are the most heavyweight.

Parallel Tile Filtering

TileDB uses the same strategy as reads:

parallel_for_each attribute being written:
parallel_for_each tile of the attribute being written:
parallel_for_each chunk of the tile:
filter chunk

Similar to reads, the sm.num_tbb_threads parameter impacts the for loops above, although it is not recommended to modify this configuration parameter from its default setting.

Parallel IO

Similar to reads, TileDB does not use TBB for these operations. Instead, it uses a writer thread pool with a work queue, whose size is controlled by configuration parameter sm.num_writer_threads(defaulting to 1). TileDB adds to the work queue every tile of every attribute, and performs all writes in parallel with the available threads in the writer pool. For HDFS, this is the only parallelization TileDB provides for writes. For the other backends, TileDB parallelizes the writes further.

For posix and Windows, if a data tile is large enough, the VFS layer partitions the tile based on configuration parameters vfs.file.max_parallel_ops and vfs.min_parallel_size. Those partitions are then written in parallel using the VFS thread pool, whose size is controlled by vfs.num_threads(defaulting to 1).

For S3, TileDB buffers potentially several tiles and issues parallel multipart upload requests to S3. The size of the buffer is equal to vfs.s3.max_parallel_ops * vfs.s3.multipart_part_size. When the buffer is filled, TileDB issues vfs.s3.max_parallel_opsparallel multipart upload requests to S3.