Performance tips

Spark has a large number of configuration parameters that can affect the performance of both the TileDB driver and the user application. In this page we provide some performance tuning tips.

Multi-threaded Reads

TileDB-Spark uses TileDB-Java and the underlying C++ libtiledb for parallel IO, compression and encryption. As such, for optimized read performance you should limit the Spark executors to one per machine and give each single executor all the resources of that machine.

Spark Configuration

Set the following Spark configuration parameters:

spark.executor.instances = # number of machines
spark.executor.memory = # 80% of total ram
spark.executor.cores = # number of cores each machine has
spark.task.cpus = # number of cores each machine has

It is important to set the Spark task CPUs to be the same as the number of executor cores. This prevents Spark from putting more than one read partition on each machine. The executor memory is set to 80% of available memory to allow for overhead on the host itself.

If using Yarn, the above configuration parameters are likely not enough. You will need to also configure Yarn similarly.

TileDB Read Parameters

There are two main TileDB Driver Options to tweak for optimizing reads, partition_count and read_buffer_size. The partition_count should be set to the number of executors, which is the number of machines in the cluster.

read_buffer_size should be set to at least 104857600 (100MB). Larger read buffers are critical to reduce the number of incomplete queries. The maximum size of the read buffers is limited based on the available RAM. If you use Yarn, the maximum buffer is also constrained by spark.yarn.executor.memoryOverhead. TileDB read/write buffers are stored off-heap.

Single-Threaded Reads

There are applications that rely on using multiple Spark tasks for parallelism, constraining each task to run on a single thread. This is common for most PySpark and SparkR applications. Below we describe how to configure Spark and the TileDB data source for optimized performance in this case.

Spark Configuration

Set the following Spark configuration parameters:

spark.executor.instances = # number of_machines * cores on each machine
spark.executor.memory = # 80% of total ram / number of executors
spark.executor.cores = 1
spark.task.cpus = 1

If you use Yarn, the above configuration parameters are likely not enough. You will need to also configure Yarn similarly.

TileDB Read Parameters

The read_buffer_size should be set to the largest value possible given the executor's available memory. TileDB typically has a memory overhead of 3x, and therefore 3 * read_buffer_size should be less than the Spark's off-heap maximum memory. If you use Yarn, this value is defined in spark.yarn.executor.memoryOverhead. A default value of 10MB is usually sufficient.

Thepartition_count should be set to a value of data size being read / read_buffer_size. If the data size is not known, then set the partition count to the number of executors. This might lead to over partitioning, as such you might want to try different sizes until you find an optimal size for your dataset.

Finally, it is important to set several of the TileDB parallelism configuration parameters in the Spark option()dataframe commands upon reading:

tiledb.sm.num_async_threads = 1,
tiledb.sm.num_reader_threads = 1,
tiledb.sm.num_tbb_threads = 1,
tiledb.sm.num_writer_threads = 1,
tiledb.vfs.num_threads = 1