Links

Build from Source

By default the high-level APIs (i.e., Python and Spark) will also build TileDB-VCF itself automatically, bundling the resulting shared libraries into the final packages. However, the APIs can also be built separately.

Dependencies

Please ensure the following dependencies are installed on your system before building TileDB-VCF:
  • CMake >= 3.3
  • C++ compiler supporting C++17 (such as gcc 7.3 or newer)
  • git
  • HTSlib 1.8
If HTSlib is not installed on your system, TileDB-VCF will download and build a local copy automatically. However, in order for this to work the following dependencies of HTSlib must be installed beforehand:
macOS
Ubuntu/Debian
Windows
brew install autoconf xz
sudo apt install autoconf automake zlib1g-dev libbz2-dev liblzma-dev
A pre-compiled HTSlib library will be downloaded automatically for Windows builds.

What would you like to build?

CLI
Python
Spark
To install just the TileDB-VCF library and CLI, execute:
# clone the repo
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/libtiledbvcf
# build
mkdir build && cd build
cmake .. && make -j8
make install-libtiledbvcf
By default this will build and install TileDB-VCF into TileDB-VCF/dist.
You can verify that the installation succeeded by checking the version via the CLI:
cd ../..
dist/bin/tiledbvcf --version
The high-level API build processes perform essentially the same steps as these, including installing the library and CLI binary intoTileDB-VCF/dist/bin. So, if you build one of the APIs, the steps above will be executed automatically for you.
Installing the Python module also requires conda. See here for instructions on how to install minconda3.
# Clone the TileDB-VCF repo and change to the Python API directory
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/python
# Set up the conda environment for building.
# This will download and install the required Python dependencies
# in a new conda environment called tiledbvcf-py
conda env create -f conda-env.yml
conda activate tiledbvcf-py
# Run the installation script
python setup.py install
The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the tiledbvcf Python package. The package and bundled native libraries get installed into the active conda environment.
You can optionally run the Python unit tests as follows:
python setup.py pytest
To test that the package was installed correctly, run:
python -c "import tiledbvcf; print(tiledbvcf.version)"

Dask Integration

The TileDB-VCF Python API contains several points of integration with Dask for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. However, to quickly test on a local machine, run:
import tiledbvcf
import dask
ds = tiledbvcf.Dataset("s3://my-bucket/my-dataset/")
dask_df = ds.read_dask(
attrs=["sample_name", "pos_start", "pos_end"],
bed_file="s3://synthetic-gvcfs/bedfiles/sorted10000.bed",
region_partitions=8)
df = dask_df.compute()
df.head()
The Spark API is implemented as a DataSourceV2and requires Spark 2.4.
# Clone the repo and change to the Spark API directory
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/spark
# Run the gradle build script
./gradlew assemble
The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.
You can optionally run the Spark unit tests as follows:
./gradlew test
To build an uber .jar, which includes all dependencies, run:
./gradlew shadowJar
This will place the .jar file in the build/libs/ directory. The .jar. file, also contains the bundled native libraries.

Set Up a Spark Cluster

Spark cluster management is outside the scope of this guide. However, you can can learn more about launching an EMR Spark cluster here.
Creating a cluster may take 10–15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username hadoop, and then perform the following setup steps:
sudo yum install -y epel-release gcc gcc-c++ git automake zlib-devel openssl-devel bzip2-devel libcurl-devel
wget https://cmake.org/files/v3.12/cmake-3.12.3-Linux-x86_64.sh
sudo sh cmake-3.12.3-Linux-x86_64.sh --skip-license --prefix=/usr/local/
Next, follow the steps above for building the Spark API using gradle. Make sure to run the ./gradlew jar step to produce the TileDB-VCF jar.
Then launch the Spark shell specifying the .jar and any other desired Spark configuration, e.g.:
spark-shell --jars build/libs/TileDB-VCF-Spark-0.1.0-SNAPSHOT.jar --driver-memory 16g --executor-memory 16g
Last modified 12d ago