Spark has a large number of configuration parameters that can affect the performance of both the TileDB driver and the user application. In this page we provide some performance tuning tips.
TileDB-Spark uses TileDB-Java and the underlying C++ libtiledb
for parallel IO, compression and encryption. As such, for optimized read performance you should limit the Spark executors to one per machine and give each single executor all the resources of that machine.
Set the following Spark configuration parameters:
It is important to set the Spark task CPUs to be the same as the number of executor cores. This prevents Spark from putting more than one read partition on each machine. The executor memory is set to 80% of available memory to allow for overhead on the host itself.
If using Yarn, the above configuration parameters are likely not enough. You will need to also configure Yarn similarly.
There are two main TileDB Driver Options to tweak for optimizing reads, partition_count
and read_buffer_size
. The partition_count
should be set to the number of executors, which is the number of machines in the cluster.
read_buffer_size
should be set to at least 104857600 (100MB). Larger read buffers are critical to reduce the number of incomplete queries. The maximum size of the read buffers is limited based on the available RAM. If you use Yarn, the maximum buffer is also constrained by spark.yarn.executor.memoryOverhead
. TileDB read/write buffers are stored off-heap.
There are applications that rely on using multiple Spark tasks for parallelism, constraining each task to run on a single thread. This is common for most PySpark and SparkR applications. Below we describe how to configure Spark and the TileDB data source for optimized performance in this case.
Set the following Spark configuration parameters:
If you use Yarn, the above configuration parameters are likely not enough. You will need to also configure Yarn similarly.
The read_buffer_size
should be set to the largest value possible given the executor's available memory. TileDB typically has a memory overhead of 3x, and therefore 3 * read_buffer_size
should be less than the Spark's off-heap maximum memory. If you use Yarn, this value is defined in spark.yarn.executor.memoryOverhead
. A default value of 10MB
is usually sufficient.
Thepartition_count
should be set to a value of data size being read
/ read_buffer_size
. If the data size is not known, then set the partition count to the number of executors. This might lead to over partitioning, as such you might want to try different sizes until you find an optimal size for your dataset.
Finally, it is important to set several of the TileDB parallelism configuration parameters in the Spark option()
dataframe commands upon reading: