Build from Source

By default the high-level APIs (i.e., Python and Spark) will also build TileDB-VCF itself automatically, bundling the resulting shared libraries into the final packages. However, the APIs can also be built separately.

Dependencies

Please ensure the following dependencies are installed on your system before building TileDB-VCF:

CMake >= 3.3
C++ compiler supporting C++20 (such as gcc 10 or newer)
git
HTSlib 1.8

If HTSlib is not installed on your system, TileDB-VCF will download and build a local copy automatically. However, in order for this to work the following dependencies of HTSlib must be installed beforehand:

brew install autoconf xz

sudo apt install autoconf automake zlib1g-dev libbz2-dev liblzma-dev

What would you like to build?

To install just the TileDB-VCF library and CLI, execute:

# clone the repo
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/libtiledbvcf

# build
mkdir build && cd build
cmake .. && make -j8
make install-libtiledbvcf

By default this will build and install TileDB-VCF into TileDB-VCF/dist.

You can verify that the installation succeeded by checking the version via the CLI:

cd ../..
dist/bin/tiledbvcf --version

The high-level API build processes perform essentially the same steps as these, including installing the library and CLI binary intoTileDB-VCF/dist/bin. So, if you build one of the APIs, the steps above will be executed automatically for you.

Installing the Python module also requires conda. See here for instructions on how to install minconda3.

# Clone the TileDB-VCF repo and change to the Python API directory
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/python

# Set up the conda environment for building. 
# This will download and install the required Python dependencies 
# in a new conda environment called tiledbvcf-py
conda env create -f conda-env.yml
conda activate tiledbvcf-py

# Run the installation script
python setup.py install

The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the tiledbvcf Python package. The package and bundled native libraries get installed into the active conda environment.

You can optionally run the Python unit tests as follows:

python setup.py pytest

To test that the package was installed correctly, run:

python -c "import tiledbvcf; print(tiledbvcf.version)"

Dask Integration

The TileDB-VCF Python API contains several points of integration with Dask for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. However, to quickly test on a local machine, run:

import tiledbvcf
import dask

ds = tiledbvcf.Dataset("s3://my-bucket/my-dataset/")
dask_df = ds.read_dask(
    attrs=["sample_name", "pos_start", "pos_end"],
    bed_file="s3://synthetic-gvcfs/bedfiles/sorted10000.bed",
    region_partitions=8)
df = dask_df.compute()
df.head()

The Spark API is implemented as a DataSourceV2and requires Spark 2.4.

# Clone the repo and change to the Spark API directory
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/spark

# Run the gradle build script
./gradlew assemble

The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.

You can optionally run the Spark unit tests as follows:

./gradlew test

To build an uber .jar, which includes all dependencies, run:

./gradlew shadowJar

This will place the .jar file in the build/libs/ directory. The .jar. file, also contains the bundled native libraries.

Set Up a Spark Cluster

Spark cluster management is outside the scope of this guide. However, you can can learn more about launching an EMR Spark cluster here.

Creating a cluster may take 10–15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username hadoop, and then perform the following setup steps:

sudo yum install -y epel-release gcc gcc-c++ git automake zlib-devel openssl-devel bzip2-devel libcurl-devel 
wget https://cmake.org/files/v3.12/cmake-3.12.3-Linux-x86_64.sh
sudo sh cmake-3.12.3-Linux-x86_64.sh --skip-license --prefix=/usr/local/

Next, follow the steps above for building the Spark API using gradle. Make sure to run the ./gradlew jar step to produce the TileDB-VCF jar.

Then launch the Spark shell specifying the .jar and any other desired Spark configuration, e.g.:

spark-shell --jars build/libs/TileDB-VCF-Spark-0.1.0-SNAPSHOT.jar --driver-memory 16g --executor-memory 16g

PreviousQuick Install NextHow To

Last updated 1 year ago

Was this helpful?

# Clone the TileDB-VCF repo and change to the Python API directory git clone https://github.com/TileDB-Inc/TileDB-VCF.git cd TileDB-VCF/apis/python # Set up the conda environment for building. # This will download and install the required Python dependencies # in a new conda environment called tiledbvcf-py conda env create -f conda-env.yml conda activate tiledbvcf-py # Run the installation script python setup.py install

import tiledbvcf import dask ds = tiledbvcf.Dataset("s3://my-bucket/my-dataset/") dask_df = ds.read_dask( attrs=["sample_name", "pos_start", "pos_end"], bed_file="s3://synthetic-gvcfs/bedfiles/sorted10000.bed", region_partitions=8) df = dask_df.compute() df.head()

sudo yum install -y epel-release gcc gcc-c++ git automake zlib-devel openssl-devel bzip2-devel libcurl-devel wget https://cmake.org/files/v3.12/cmake-3.12.3-Linux-x86_64.sh sudo sh cmake-3.12.3-Linux-x86_64.sh --skip-license --prefix=/usr/local/