By default the high-level APIs (i.e., Python and Spark) will also build TileDB-VCF itself automatically, bundling the resulting shared libraries into the final packages. However, the APIs can also be built separately.
Please ensure the following dependencies are installed on your system before building TileDB-VCF:
CMake >= 3.3
C++ compiler supporting C++20 (such as gcc 10 or newer)
git
HTSlib 1.8
If HTSlib is not installed on your system, TileDB-VCF will download and build a local copy automatically. However, in order for this to work the following dependencies of HTSlib must be installed beforehand:
A pre-compiled HTSlib library will be downloaded automatically for Windows builds.
To install just the TileDB-VCF library and CLI, execute:
By default this will build and install TileDB-VCF into TileDB-VCF/dist
.
You can verify that the installation succeeded by checking the version via the CLI:
The high-level API build processes perform essentially the same steps as these, including installing the library and CLI binary intoTileDB-VCF/dist/bin
. So, if you build one of the APIs, the steps above will be executed automatically for you.
Installing the Python module also requires conda
. See here for instructions on how to install minconda3.
The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the tiledbvcf
Python package. The package and bundled native libraries get installed into the active conda
environment.
You can optionally run the Python unit tests as follows:
To test that the package was installed correctly, run:
The TileDB-VCF Python API contains several points of integration with Dask for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. However, to quickly test on a local machine, run:
The Spark API is implemented as a DataSourceV2
and requires Spark 2.4.
The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.
You can optionally run the Spark unit tests as follows:
To build an uber .jar
, which includes all dependencies, run:
This will place the .jar
file in the build/libs/
directory. The .jar.
file, also contains the bundled native libraries.
Spark cluster management is outside the scope of this guide. However, you can can learn more about launching an EMR Spark cluster here.
Creating a cluster may take 10–15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username hadoop
, and then perform the following setup steps:
Next, follow the steps above for building the Spark API using gradle. Make sure to run the ./gradlew jar
step to produce the TileDB-VCF jar.
Then launch the Spark shell specifying the .jar
and any other desired Spark configuration, e.g.: