The Spark API is implemented as a
DataSourceV2, and requires Spark 2.4.
First clone the TileDB-VCF repo, change to the Spark API directory and run the gradle build script:
git clone firstname.lastname@example.org:TileDB-Inc/TileDB-VCF.gitcd apis/spark./gradlew assemble
This step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.
You can optionally run the Spark unit tests as follows:
To build a
This will place the
.jar file in the
build/libs/ directory. The
.jar. file, also contains the bundled native libraries.
Spark cluster management in general is well outside the scope of this guide. You can see these instructions for launching an EMR Spark cluster.
Creating the cluster will take 10-15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username
hadoop, and then perform the following setup steps:
sudo yum install -y epel-release gcc gcc-c++ git automake zlib-devel openssl-devel bzip2-devel libcurl-develwget https://cmake.org/files/v3.12/cmake-3.12.3-Linux-x86_64.shsudo sh cmake-3.12.3-Linux-x86_64.sh --skip-license --prefix=/usr/local/
Next, follow the steps above for building the Spark API using gradle. Make sure to run the
./gradlew jar step to produce the TileDB-VCF jar.
Then launch the Spark shell specifying the
.jar and any other desired Spark configuration, e.g.:
spark-shell --jars build/libs/TileDB-VCF-Spark-0.1.0-SNAPSHOT.jar --driver-memory 16g --executor-memory 16g