Partitioning

The TileDB-Spark data source allows you to specify a partition count when reading a TileDB Array into a distributed Spark dataframe, via the partition_count option. An example is shown below. This creates evenly sized partitions across all array dimensions, based on the volume of the subarray, in order to balance the computational load across the workers.

val df = spark.read
              .format("io.tiledb.spark")
              .option("uri", "s3://my_bucket/my_array")
              .option("partition_count", 100)
              .load()

df = spark.read
          .format("io.tiledb.spark")
          .option("uri", "s3://my_bucket/my_array")
          .option("partition_count", 100)
          .load()

df <- read.df(
    uri = "s3://my_bucket/array_new", 
    source = "io.tiledb.spark", 
    partition_count = 100)

PreviousSupported Datatypes NextMetrics

Last updated 4 years ago

Was this helpful?

Partitioning

val df = spark.read
              .format("io.tiledb.spark")
              .option("uri", "s3://my_bucket/my_array")
              .option("partition_count", 100)
              .load()

df = spark.read
          .format("io.tiledb.spark")
          .option("uri", "s3://my_bucket/my_array")
          .option("partition_count", 100)
          .load()

df <- read.df(
    uri = "s3://my_bucket/array_new", 
    source = "io.tiledb.spark", 
    partition_count = 100)

PreviousSupported Datatypes NextMetrics

Last updated 4 years ago

Was this helpful?