libtiledbfor parallel IO, compression and encryption. As such, for optimized read performance you should limit the Spark executors to one per machine and give each single executor all the resources of that machine.
partition_countshould be set to the number of executors, which is the number of machines in the cluster.
read_buffer_sizeshould be set to at least 104857600 (100MB). Larger read buffers are critical to reduce the number of incomplete queries. The maximum size of the read buffers is limited based on the available RAM. If you use Yarn, the maximum buffer is also constrained by
spark.yarn.executor.memoryOverhead. TileDB read/write buffers are stored off-heap.
read_buffer_sizeshould be set to the largest value possible given the executor's available memory. TileDB typically has a memory overhead of 3x, and therefore
3 * read_buffer_sizeshould be less than the Spark's off-heap maximum memory. If you use Yarn, this value is defined in
spark.yarn.executor.memoryOverhead. A default value of
10MBis usually sufficient.
partition_countshould be set to a value of
data size being read/
read_buffer_size. If the data size is not known, then set the partition count to the number of executors. This might lead to over partitioning, as such you might want to try different sizes until you find an optimal size for your dataset.
option()dataframe commands upon reading: