Spark is a very popular analytics engine for large-scale data processing. It allows users to create distributed arrays and dataframes, use machine learning libraries, perform SQL queries, etc. TileDB-Spark is TileDB's datasource driver for Spark, which allows the user to create distributed Spark dataframes from TileDB arrays and, thus, process TileDB data with familiar tooling at great scale with minimum code changes.
The TileDB-Java package from Maven central includes a prebuilt
libtiledb for Linux. If you are running Linux you can skip this section. If you are on another platform besides Linux, such as macOS, you will first need to build and install TileDB-Java locally (see TileDB-Java installation for more details):
First clone the TileDB-Java repo:
$ git clone https://github.com/TileDB-Inc/TileDB-Java
Next build and install to local maven:
$ cd TileDB-Java$ ./gradlew assemble$ ./gradlew publishToMavenLocal
Compiling TileDB-Spark from source is simple:
git clone [email protected]:TileDB-Inc/TileDB-Spark.gitcd TileDB-Spark./gradlew assemble./gradlew shadowJar
This will create a jar file
TileDB offers a prebuilt uber jar that contains all dependencies. This can be used on most Spark clusters to enable the TileDB-Spark datasource driver.
The latest jars can be downloaded from Github.
There is a known issue with EMR clusters and the prebuilt jars. S3 access is broken because of a problem with
libcurl. If you use EMR, you must build from source, or follow our EMR instructions.
To launch a spark shell with TileDB-Spark enabled simply point Spark to the jar you have obtained:
spark-shell --jars /path/to/tiledb-spark-<version>.jar
sparkR --jars /path/to/tiledb-spark-<version>.jar
pyspark --jars /path/to/tiledb-spark-<version>.jar