Python

Basic Installation

In addition to the system dependencies listed here, installing the Python module also requires conda. See here for instructions on how to install minconda3.

First clone the TileDB-VCF repo and change to the Python API directory:

git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/python

Next, set up the conda environment for building. This will download and install the required Python dependencies in a new conda environment called tiledbvcf-py:

conda env create -f conda-env.yml
conda activate tiledbvcf-py

Once you have activated the conda environment, run the installation script:

python setup.py install

This step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the tiledbvcf Python package. The package and bundled native libraries get installed into the active conda environment.

You can optionally run the Python unit tests as follows:

python setup.py pytest

To test that the package was installed correctly, run:

$ python
>>> import tiledbvcf

If you do not get any warnings or errors when importing the module, the package was installed correctly.

Dask Integration

The TileDB-VCF Python API contains several points of integration with Dask for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. To quickly test on a local machine, run:

Python
Python
import tiledbvcf
import dask
ds = tiledbvcf.TileDBVCFDataset('s3://my-bucket/my-dataset/')
dask_df = ds.read_dask(attrs=['sample_name', 'pos_start', 'pos_end'],
bed_file='s3://synthetic-gvcfs/bedfiles/sorted10000.bed',
region_partitions=8)
df = dask_df.compute()
df.head()