Spark

The Spark API is implemented as a DataSourceV2, and requires Spark 2.4.

Build From Source

First clone the TileDB-VCF repo, change to the Spark API directory and run the gradle build script:

git clone git@github.com:TileDB-Inc/TileDB-VCF.git
cd apis/spark
./gradlew assemble

This step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.

You can optionally run the Spark unit tests as follows:

./gradlew test

To build a .jar, run:

./gradlew jar

This will place the .jar file in the build/libs/ directory. The .jar. file, also contains the bundled native libraries.

Set Up a Spark Cluster

Spark cluster management in general is well outside the scope of this guide. You can see these instructions for launching an EMR Spark cluster.

Creating the cluster will take 10-15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username hadoop, and then perform the following setup steps:

sudo yum install -y epel-release gcc gcc-c++ git automake zlib-devel openssl-devel bzip2-devel libcurl-devel
wget https://cmake.org/files/v3.12/cmake-3.12.3-Linux-x86_64.sh
sudo sh cmake-3.12.3-Linux-x86_64.sh --skip-license --prefix=/usr/local/

Next, follow the steps above for building the Spark API using gradle. Make sure to run the ./gradlew jar step to produce the TileDB-VCF jar.

Then launch the Spark shell specifying the .jar and any other desired Spark configuration, e.g.:

spark-shell --jars build/libs/TileDB-VCF-Spark-0.1.0-SNAPSHOT.jar --driver-memory 16g --executor-memory 16g