Quickstart
Spark is a very popular analytics engine for large-scale data processing. It allows users to create distributed arrays and dataframes, use machine learning libraries, perform SQL queries, etc. TileDB-Spark is TileDB's datasource driver for Spark, which allows the user to create distributed Spark dataframes from TileDB arrays and, thus, process TileDB data with familiar tooling at great scale with minimum code changes.

TileDB-Spark Installation

Install TileDB Spark Driver

Prebuilt Jar
Build From Source
TileDB offers a prebuilt uber jar that contains all dependencies. This can be used on most Spark clusters to enable the TileDB-Spark datasource driver.
The latest jars can be downloaded from Github.
Compiling TileDB-Spark from source is simple:
1
git clone [email protected]:TileDB-Inc/TileDB-Spark.git
2
cd TileDB-Spark
3
./gradlew assemble
4
./gradlew shadowJar
Copied!
This will create a jar file /path/to/TileDB-Spark/build/libs/tiledb-spark-<version>.jar.

Running Spark

To launch a spark shell with TileDB-Spark enabled simply point Spark to the jar you have obtained:
Scala
SparkR
PySpark
1
spark-shell --jars /path/to/tiledb-spark-<version>.jar
Copied!
1
sparkR --jars /path/to/tiledb-spark-<version>.jar
Copied!
1
pyspark --jars /path/to/tiledb-spark-<version>.jar
Copied!
Last modified 1yr ago