Partitioning

The TileDB-Spark data source allows you to specify a partition count when reading a TileDB Array into a distributed Spark dataframe, via the partition_count option. An example is shown below. This creates evenly sized partitions across all array dimensions, based on the volume of the subarray, in order to balance the computational load across the workers.

Scala
PySpark
SparkR
val df = spark.read
.format("io.tiledb.spark")
.option("uri", "s3://my_bucket/my_array")
.option("partition_count", 100)
.load()
df = spark.read
.format("io.tiledb.spark")
.option("uri", "s3://my_bucket/my_array")
.option("partition_count", 100)
.load()
df <- read.df(
uri = "s3://my_bucket/array_new",
source = "io.tiledb.spark",
partition_count = 100)

‚Äč