1 of 3

Installation

You can install TileDB-VCF in two ways:

Quick Install

You can easily install TileDB-VCF via conda or use one of our Docker images.

# Install the TileDB-VCF python package
conda install -c conda-forge -c bioconda -c tiledb tiledbvcf-py

# Verify the installation
python -c "import tiledbvcf; print(tiledbvcf.version)"

# Install the CLI and shared library
conda install -c conda-forge -c bioconda -c tiledb libtiledbvcf

# Verify the installation
tiledbvcf version

Pre-built Docker images are available on Docker Hub.

Available images

tiledbvcf-cli for the CLI
tiledbvcf-py for the Python package

Supported tags

latest: latest stable release (recommended)
dev: development version
v0.x.x for a specific version

Example usage

# CLI
docker run --rm tiledb/tiledbvcf-cli list \
  --uri s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20
  
# Python
docker run -it --rm tiledb/tiledbvcf-py

Build from Source

By default the high-level APIs (i.e., Python and Spark) will also build TileDB-VCF itself automatically, bundling the resulting shared libraries into the final packages. However, the APIs can also be built separately.

Dependencies

Please ensure the following dependencies are installed on your system before building TileDB-VCF:

CMake >= 3.3
C++ compiler supporting C++20 (such as gcc 10 or newer)
git
HTSlib 1.8

If HTSlib is not installed on your system, TileDB-VCF will download and build a local copy automatically. However, in order for this to work the following dependencies of HTSlib must be installed beforehand:

brew install autoconf xz

sudo apt install autoconf automake zlib1g-dev libbz2-dev liblzma-dev

A pre-compiled HTSlib library will be downloaded automatically for Windows builds.

What would you like to build?

To install just the TileDB-VCF library and CLI, execute:

# clone the repo
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/libtiledbvcf

# build
mkdir build && cd build
cmake .. && make -j8
make install-libtiledbvcf

By default this will build and install TileDB-VCF into TileDB-VCF/dist.

You can verify that the installation succeeded by checking the version via the CLI:

cd ../..
dist/bin/tiledbvcf --version

The high-level API build processes perform essentially the same steps as these, including installing the library and CLI binary intoTileDB-VCF/dist/bin. So, if you build one of the APIs, the steps above will be executed automatically for you.

Installing the Python module also requires conda. See here for instructions on how to install minconda3.

# Clone the TileDB-VCF repo and change to the Python API directory
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/python

# Set up the conda environment for building. 
# This will download and install the required Python dependencies 
# in a new conda environment called tiledbvcf-py
conda env create -f conda-env.yml
conda activate tiledbvcf-py

# Run the installation script
python setup.py install

The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the tiledbvcf Python package. The package and bundled native libraries get installed into the active conda environment.

You can optionally run the Python unit tests as follows:

python setup.py pytest

To test that the package was installed correctly, run:

python -c "import tiledbvcf; print(tiledbvcf.version)"

Dask Integration

The TileDB-VCF Python API contains several points of integration with Dask for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. However, to quickly test on a local machine, run:

import tiledbvcf
import dask

ds = tiledbvcf.Dataset("s3://my-bucket/my-dataset/")
dask_df = ds.read_dask(
    attrs=["sample_name", "pos_start", "pos_end"],
    bed_file="s3://synthetic-gvcfs/bedfiles/sorted10000.bed",
    region_partitions=8)
df = dask_df.compute()
df.head()

The Spark API is implemented as a DataSourceV2and requires Spark 2.4.

# Clone the repo and change to the Spark API directory
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/spark

# Run the gradle build script
./gradlew assemble

The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.

You can optionally run the Spark unit tests as follows:

./gradlew test

To build an uber .jar, which includes all dependencies, run:

./gradlew shadowJar

This will place the .jar file in the build/libs/ directory. The .jar. file, also contains the bundled native libraries.

Set Up a Spark Cluster

Spark cluster management is outside the scope of this guide. However, you can can learn more about launching an EMR Spark cluster here.

Creating a cluster may take 10–15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username hadoop, and then perform the following setup steps:

sudo yum install -y epel-release gcc gcc-c++ git automake zlib-devel openssl-devel bzip2-devel libcurl-devel 
wget https://cmake.org/files/v3.12/cmake-3.12.3-Linux-x86_64.sh
sudo sh cmake-3.12.3-Linux-x86_64.sh --skip-license --prefix=/usr/local/

Next, follow the steps above for building the Spark API using gradle. Make sure to run the ./gradlew jar step to produce the TileDB-VCF jar.

Then launch the Spark shell specifying the .jar and any other desired Spark configuration, e.g.:

spark-shell --jars build/libs/TileDB-VCF-Spark-0.1.0-SNAPSHOT.jar --driver-memory 16g --executor-memory 16g