First clone the TileDB-VCF repo and change to the Python API directory:
git clone https://github.com/TileDB-Inc/TileDB-VCF.gitcd TileDB-VCF/apis/python
Next, set up the conda environment for building. This will download and install the required Python dependencies in a new conda environment called
conda env create -f conda-env.ymlconda activate tiledbvcf-py
Once you have activated the conda environment, run the installation script:
python setup.py install
This step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the
tiledbvcf Python package. The package and bundled native libraries get installed into the active conda environment.
You can optionally run the Python unit tests as follows:
python setup.py pytest
To test that the package was installed correctly, run:
$ python>>> import tiledbvcf
If you do not get any warnings or errors when importing the module, the package was installed correctly.
The TileDB-VCF Python API contains several points of integration with Dask for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. To quickly test on a local machine, run:
import tiledbvcfimport daskds = tiledbvcf.TileDBVCFDataset('s3://my-bucket/my-dataset/')dask_df = ds.read_dask(attrs=['sample_name', 'pos_start', 'pos_end'],bed_file='s3://synthetic-gvcfs/bedfiles/sorted10000.bed',region_partitions=8)df = dask_df.compute()df.head()