1 of 17

TileDB-VCF

TileDB-VCF is an open-source C++ library for efficient storage and retrieval of genomic variant call data based on TileDB Open Source.

History

TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute. The resulting software, GenomicsDB, became part of Broad's GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of TileDB Open Source, incorporating new algorithms, features and optimizations. TileDB-VCF is the product of a collaboration between TileDB, Inc. and Helix.

Why use TileDB-VCF?

TileDB-VCF offers several important benefits:

Performance: TileDB-VCF is optimized for rapidly slicing variant records by genomic regions across arbitrary numbers of samples. Additional features for ingesting VCF files and performing genomic interval intersections are all implemented in C++ for maximum speed.
Compressibility: TileDB-VCF can efficiently store any number of samples in a compressed and lossless manner. Being a columnar format, TileDB-VCF can compress different VCF fields with different compressors, choosing the most appropriate one depending on the data types and nature of the data per field.
Optimized for cloud object stores: Built on TileDB Open Source, TileDB-VCF inherits all its features out of the box, such as the speed and optimizations on numerous storage backends, including cloud object stores such as AWS S3, Azure Blob Storage and Google Cloud Storage.
Updatability: TileDB-VCF allows you to rapidly add new samples to existing datasets, eliminating the so-called N+1 problem and scaling both storage and update time linearly with the number of new samples.
Interoperability: In addition to a command-line interface, TileDB-VCF provides C++ and Python APIs, as well as integrations with Spark and Dask. This allows the user to utilize a plethora of tools from the Data Science and Machine Learning ecosystems in a seamless and natural fashion, without having to convert data from one format to another or use unfamiliar APIs.
Open-source: TileDB Open Source and TileDB-VCF are both open source and developed under the permissive MIT license. We are passionate about supporting the scientific community with open-source software and welcome feedback and contributions to further improve our projects.
Battle-tested: TileDB-VCF has been used in production by very large customers to manage immense amounts of variant data, in the order of many hundreds of thousands of samples.

What's next?

The documentation of TileDB-VCF includes information about the data model (and its mapping to 3D sparse arrays), installation instructions, How To guides and the API reference:

Data Model

TileDB-VCF uses 3D sparse arrays to store genomic variant data. This section describes the technical implementation details about the underlying model, including array schemas, dimensions, tiling order, attributes, and metadata. It is recommended that you read the following section to fully understand the mapping between the TileDB-VCF data and TileDB arrays and groups:

TileDB-VCF Dataset

A TileDB-VCF dataset is composed of a group of two separate TileDB arrays:

a 3D sparse array for the actual genomic variants and associated fields/attributes
a 1D sparse array for the metadata stored in each single-sample VCF header

Data Array

Basic schema parameters

Dimensions

The dimensions in the schema are:

The coordinates of the 3D array are contig along the first dimension, chromosomal location of the variants start position along the second dimension, and sample names along the third dimension.

Attributes

For each field in a single-sample VCF record there is a corresponding attribute in the schema.

The info_* and fmt_* attributes allow individual INFO or FMT VCF fields to be extracted into explicit array attributes. This can be beneficial if your queries frequently access only a subset of the INFO or FMT fields, as no unrelated data then needs to be fetched from storage.

The choice of which fields to extract as explicit array attributes is user-configurable during array creation.

Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes, info and fmt.

Metadata

anchor_gap Anchor gap value
extra_attributes List of INFO or FMT field names that are stored as explicit array attributes
version Array schema version

These metadata values are updated during array creation, and are used during the export phase. The metadata is stored as "array metadata" in the sparse data array.

When ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header. This requirement will be relaxed in future versions.

VCF Headers Array

The vcf_headers array stores the original text of every ingested VCF header in order to:

ensure the original VCF file can be fully recovered for any given sample
reconstruct an htslib header instance when reading from the dataset, which is used for operations such as mapping a filter ID back to the filter string, etc.

Basic schema parameters

Dimensions

Attributes

Putting It All Together

To summarize, we've described three main entities:

The variant data array (3D sparse)
The general metadata, stored in the variant data array as metadata
The VCF header array (1D sparse)

All together we term this a "TileDB-VCF dataset." Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:

<dataset_uri>/
  |_ __tiledb_group.tdb
  |_ data/
      |_ __array_schema.tdb
      |_ __meta/
            |_ <general-metadata-here>
      ... <other array directories/fragments and files>
  |_ vcf_headers/
      |_ __array_schema.tdb
      ... <other array directories/fragments and files>

The root of the dataset, <dataset_uri> is a TileDB group. The data member is the TileDB 3D sparse array storing the variant data. This array stores the general TileDB-VCF metadata as its array metadata in folder data/__meta. The vcf_headers member is the TileDB 1D sparse array containing the VCF header data.

Configurable Parameters

During array creation, there are several array-related parameters that the user can control. These are:

Array data tile capacity (default 10,000)
The "anchor gap" size (default 1,000)
The list of INFO and FMT fields to store as explicit array attributes (default is none).

Once chosen, these parameters cannot be changed.

During sample ingestion, the user can specify the:

Sample batch size (default 10)

The above parameters may impact read and write performance, as well as the size of the persisted array. Therefore, some care should be taken to determine good values for these parameters before ingesting a large amount of data into an array.

Installation

You can install TileDB-VCF in two ways:

Quick Install

You can easily install TileDB-VCF via conda or use one of our Docker images.

# Install the TileDB-VCF python package
conda install -c conda-forge -c bioconda -c tiledb tiledbvcf-py

# Verify the installation
python -c "import tiledbvcf; print(tiledbvcf.version)"

# Install the CLI and shared library
conda install -c conda-forge -c bioconda -c tiledb libtiledbvcf

# Verify the installation
tiledbvcf version

Pre-built Docker images are available on Docker Hub.

Available images

tiledbvcf-cli for the CLI
tiledbvcf-py for the Python package

Supported tags

latest: latest stable release (recommended)
dev: development version
v0.x.x for a specific version

Example usage

# CLI
docker run --rm tiledb/tiledbvcf-cli list \
  --uri s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20
  
# Python
docker run -it --rm tiledb/tiledbvcf-py

Build from Source

By default the high-level APIs (i.e., Python and Spark) will also build TileDB-VCF itself automatically, bundling the resulting shared libraries into the final packages. However, the APIs can also be built separately.

Dependencies

Please ensure the following dependencies are installed on your system before building TileDB-VCF:

CMake >= 3.3
C++ compiler supporting C++20 (such as gcc 10 or newer)
git
HTSlib 1.8

If HTSlib is not installed on your system, TileDB-VCF will download and build a local copy automatically. However, in order for this to work the following dependencies of HTSlib must be installed beforehand:

brew install autoconf xz

sudo apt install autoconf automake zlib1g-dev libbz2-dev liblzma-dev

A pre-compiled HTSlib library will be downloaded automatically for Windows builds.

What would you like to build?

To install just the TileDB-VCF library and CLI, execute:

# clone the repo
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/libtiledbvcf

# build
mkdir build && cd build
cmake .. && make -j8
make install-libtiledbvcf

By default this will build and install TileDB-VCF into TileDB-VCF/dist.

You can verify that the installation succeeded by checking the version via the CLI:

cd ../..
dist/bin/tiledbvcf --version

The high-level API build processes perform essentially the same steps as these, including installing the library and CLI binary intoTileDB-VCF/dist/bin. So, if you build one of the APIs, the steps above will be executed automatically for you.

Installing the Python module also requires conda. See here for instructions on how to install minconda3.

# Clone the TileDB-VCF repo and change to the Python API directory
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/python

# Set up the conda environment for building. 
# This will download and install the required Python dependencies 
# in a new conda environment called tiledbvcf-py
conda env create -f conda-env.yml
conda activate tiledbvcf-py

# Run the installation script
python setup.py install

The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the tiledbvcf Python package. The package and bundled native libraries get installed into the active conda environment.

You can optionally run the Python unit tests as follows:

python setup.py pytest

To test that the package was installed correctly, run:

python -c "import tiledbvcf; print(tiledbvcf.version)"

Dask Integration

The TileDB-VCF Python API contains several points of integration with Dask for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. However, to quickly test on a local machine, run:

import tiledbvcf
import dask

ds = tiledbvcf.Dataset("s3://my-bucket/my-dataset/")
dask_df = ds.read_dask(
    attrs=["sample_name", "pos_start", "pos_end"],
    bed_file="s3://synthetic-gvcfs/bedfiles/sorted10000.bed",
    region_partitions=8)
df = dask_df.compute()
df.head()

The Spark API is implemented as a DataSourceV2and requires Spark 2.4.

# Clone the repo and change to the Spark API directory
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/spark

# Run the gradle build script
./gradlew assemble

The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.

You can optionally run the Spark unit tests as follows:

./gradlew test

To build an uber .jar, which includes all dependencies, run:

./gradlew shadowJar

This will place the .jar file in the build/libs/ directory. The .jar. file, also contains the bundled native libraries.

Set Up a Spark Cluster

Spark cluster management is outside the scope of this guide. However, you can can learn more about launching an EMR Spark cluster here.

Creating a cluster may take 10–15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username hadoop, and then perform the following setup steps:

sudo yum install -y epel-release gcc gcc-c++ git automake zlib-devel openssl-devel bzip2-devel libcurl-devel 
wget https://cmake.org/files/v3.12/cmake-3.12.3-Linux-x86_64.sh
sudo sh cmake-3.12.3-Linux-x86_64.sh --skip-license --prefix=/usr/local/

Next, follow the steps above for building the Spark API using gradle. Make sure to run the ./gradlew jar step to produce the TileDB-VCF jar.

Then launch the Spark shell specifying the .jar and any other desired Spark configuration, e.g.:

spark-shell --jars build/libs/TileDB-VCF-Spark-0.1.0-SNAPSHOT.jar --driver-memory 16g --executor-memory 16g

How To

This section includes useful guides about the usage of TileDB-VCF:

Create a Dataset

The first step before ingesting any VCF samples is to create a dataset. This effectively creates a TileDB group and the appropriate empty arrays in it.

If you wish to turn some of the INFO and FMT fields into separate materialized attributes, you can do so as follows (names should be fmt_X or info_X for a field name X - case sensitive).

Ingest Samples

You can ingest samples into an already created dataset as follows:

Just add a regular expression for the VCF file locations at the end of the store command:

Alternatively, provide a text file with the absolute locations of the VCF files, separated by a new line:

Incremental updates work in the same manner as the ingestion above, nothing special is needed. In addition, the ingestion is thread- and process-safe and, therefore, can be performed in parallel.

Read from the Dataset

Basic Utils

Before slicing the data, you may wish to get some information about your dataset, such as the sample names, the attributes you can query, etc.

You can get the sample names as follows:

You can get the attributes as follows:

Reading

You can rapidly read from a TileDB-VCF dataset by providing three main parameters (all optional):

A subset of the samples
A subset of the attributes
One or more genomic ranges
1. Either as strings in format chr:pos_range
2. Or via a BED file

Export to VCF

You can use the TileDB-VCF CLI to export the TileDB-VCF ingested dataset back into VCF formats for downstream analyses, in a lossless way.

While these exports are lossless in terms of the actual data stored, they may not be identical to the original files. For example, fields within the INFO and FORMAT columns may appear in a slightly different order in the exported files.

Basic export

To recreate all of original (single-sample) VCF files simply run the export command and set the --output-formatto v, for VCF.

If bcftools is available on your system you can use it to easily examine any of the exported files:

Filtering variants

Handle Large Queries

Unlike TileDB-VCF's CLI, which exports directly to disk, results for queries performed using Python are read into memory. Therefore, when querying even moderately sized genomic datasets, the amount of available memory must be taken into consideration.

This guide demonstrates several of the TileDB-VCF features for overcoming memory limitations when querying large datasets.

Setting the Memory Budget

One strategy for accommodating large queries is to simply increase the amount of memory available to tiledbvcf. By default tiledbvcf allocates 2GB of memory for queries. However, this value can be adjusted using the memory_budget_mb parameter. For the purposes of this tutorial the budget will be decreased to demonstrate how tiledbvcf is able to perform genome-scale queries even in a memory constrained environment.

import tiledbvcf
cfg = tiledbvcf.ReadConfig(memory_budget_mb=256)
ds = tiledbvcf.Dataset(uri, mode = "r", cfg = cfg)

Performing Batched Reads

For queries that encompass many genomic regions you can simply provide an external bed file. In this example, you will query for any variants located in the promoter region of a known gene located on chromosomes 1-4.

After performing a query, you can use read_completed() to verify whether or not all results were successfully returned.

attrs = ["sample_name", "contig", "pos_start", "fmt_GT"]
df = ds.read(attrs, bed_file = "data/gene-promoters-hg38.bed")
ds.read_completed()

## False

In this case, it returned False, indicating the requested data was too large to fit into the allocated memory so tiledbvcf retrieved as many records as possible in this first batch. The remaining records can be retrieved using continue_read(). Here, we've setup our code to accommodate the possibility that the full set of results are split across multiple batches.

print ("The dataframe contains")

while not ds.read_completed():
    print (f"\t...{df.shape[0]} rows")
    df = df.append(ds.continue_read())

print (f"\t...{df.shape[0]} rows")

## The dataframe contains
##   ...1525201 rows
##   ...3050402 rows
##   ...3808687 rows

Here is the final dataframe, which includes 3,808,687 records:

df

##         sample_name contig  pos_start    fmt_GT
## 0       v2-Qhhvcspe   chr1          1  [-1, -1]
## 1       v2-YMaDHIoW   chr1          1  [-1, -1]
## 2       v2-Mcwmkqnx   chr1          1  [-1, -1]
## 3       v2-RzweTRSv   chr1          1  [-1, -1]
## 4       v2-ijrKdkKh   chr1          1  [-1, -1]
## ...             ...    ...        ...       ...
## 758280  v2-PDeVyHSO   chr4  190063262    [0, 0]
## 758281  v2-PDeVyHSO   chr4  190063264  [-1, -1]
## 758282  v2-PDeVyHSO   chr4  190063265  [-1, -1]
## 758283  v2-PDeVyHSO   chr4  190063392    [0, 0]
## 758284  v2-PDeVyHSO   chr4  190063418  [-1, -1]
## 
## [3808687 rows x 4 columns]

Iteration

A Python generator version of the read method is also provided. This pattern provides a powerful interface for batch processing variant data.

ds = tiledbvcf.Dataset(uri, mode = "r", cfg = cfg)

df = pd.DataFrame()
for batch in ds.read_iter(attrs, bed_file = "data/gene-promoters-hg38.bed"):
    df = df.append(batch, ignore_index = True)

df

##          sample_name contig  pos_start    fmt_GT
## 0        v2-Qhhvcspe   chr1          1  [-1, -1]
## 1        v2-YMaDHIoW   chr1          1  [-1, -1]
## 2        v2-Mcwmkqnx   chr1          1  [-1, -1]
## 3        v2-RzweTRSv   chr1          1  [-1, -1]
## 4        v2-ijrKdkKh   chr1          1  [-1, -1]
## ...              ...    ...        ...       ...
## 3808682  v2-PDeVyHSO   chr4  190063262    [0, 0]
## 3808683  v2-PDeVyHSO   chr4  190063264  [-1, -1]
## 3808684  v2-PDeVyHSO   chr4  190063265  [-1, -1]
## 3808685  v2-PDeVyHSO   chr4  190063392    [0, 0]
## 3808686  v2-PDeVyHSO   chr4  190063418  [-1, -1]
## 
## [3808687 rows x 4 columns]

Work with Cloud Object Stores

TileDB Open Source provides native support for reading from and writing to cloud object stores like AWS S3, Google Cloud Storage, and Microsoft Azure Blob Store. This guide will cover some considerations for using TileDB-VCF with these services. The examples will focus exclusively on S3, which is the most widely used, but note any of the aforementioned services can be substituted, as well as on-premise services like MinIO that provide S3-compatible APIs.

Remote Datasets

The process of creating a TileDB-VCF dataset on S3 is nearly identical to creating a local dataset. The only difference being an s3:// address is passed to the --uri argument rather than a local file path.

tiledbvcf create --uri s3://my-bucket/my_dataset

This also works when querying a TileDB-VCF dataset located on S3.

tiledbvcf export \
  --uri s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20 \
  --sample-names v2-tJjMfKyL,v2-eBAdKwID \
  -Ot --tsv-fields "CHR,POS,REF,S:GT" \
  --regions "chr7:144000320-144008793,chr11:56490349-56491395"

Remote VCF Files

VCF files located on S3 can be ingested directly into a TileDBVCF dataset using 1 of 2 different possible approaches.

Direct Ingestion

The first approach is the easiest, you simply pass the tiledbvcf store command a list of S3 URIs and TileDB-VCF takes care of the rest:

tiledbvcf store \
    --uri my_dataset \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

In this approach, remote VCF index files (which are relatively tiny) are downloaded locally, allowing TileDB-VCF to retrieve chunks of variant data from the remote VCF files without having to download them in full. By default, index files are downloaded to your current working directory, however, you can choose to store them in different location (e.g., a temporary directory) using the --scratch-dir argument.

Batched Downloading

The second approach is to download batches of VCF files in their entirety before ingestion, which may slightly improve ingestion performance. This approach requires allocating TileDB-VCF with scratch disk space using the --scratch-mb and --scratch-dir arguments.

tiledbvcf store \
    --uri my_dataset \
    --scratch-dir "$TMPDIR" \
    --scratch-mb 4096 \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

The number of VCF files that are downloaded at a time is determined by the --sample-batch-size parameter, which defaults to 10. Downloading and ingestion happens asynchronously, so, for example, batch 3 will be downloaded as batch 2 is being ingestion. As a result, you must configure enough scratch space to store at least 20 samples, assuming a batch size of 10.

Authentication

For TileDB to access a remote storage bucket you must be properly authenticated on the machine running TileDB. For S3, this means having access to the appropriate AWS access key ID and secret access key. This typically happens in one of three ways:

1. Using the AWS CLI

If the AWS Command Line Interface (CLI) is installed on your machine, running aws configure will store your credentials in a local profile that TileDB can access. You can verify the CLI has been previously configured by running:

aws s3 ls

If properly configured, this will output a list of the S3 buckets you (and thus TileDB) can access.

2. Using Configuration Parameters

You can pass your AWS access key ID and secret access key to TileDB-VCF directly via the --tiledb-config argument, which expects a comma-separated string:

tiledbvcf store \
    --uri my_dataset \
    --tiledb-config vfs.s3.aws_access_key_id=<id>,vfs.s3.aws_secret_access_key=<secret> \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

3. Using Environment Variables

Your AWS credentials can also be passed to TileDB by defining the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Perform Distributed Queries with Spark

Reading

TileDB-VCF's Spark API offers a DataSourceV2 data source to read TileDB-VCF datasets into a Spark dataframe. To begin, launch a Spark shell with:

Depending on the size of the dataset and query results, you may need to increase the configured memory from the defaults.:

While the Spark API offers much of the same functionality provided by the CLI, the main considerations when using the Spark API are typically dataframe partitioning and memory overhead.

Partitioning

There are two ways to partition a TileDB-VCF dataframe, which can be used separately or together:

Partitioning over samples (.option("sample_partitions", N)).
Partitioning over genomic regions (.option("range_partitions", M)).

Conceptually, these correspond to partitioning over rows and columns, respectively, in the underlying TileDB array. For example, if you are reading a subset of 200 samples in a dataset, and specify 10 sample partitions, Spark will create 10 jobs, each of which will be responsible for handling the export from 20 of the 200 selected samples.

Similarly, if the provided BED file contains 1,000,000 regions, and you specify 10 region partitions, Spark will create 10 jobs, each of which will be responsible for handling the export of 100,000 of the 1,000,000 regions (across all samples).

These two partitioning parameters can be composed to form rectangular regions of work to distribute across the available Spark executors.

The CLI interface offers the same partitioning feature, using the --sample-partition and --region-partition flags.

Memory

Because TileDB-VCF is implemented as a native library, the Spark API makes use of both JVM on-heap and off-heap memory. This can make it challenging to correctly tune the memory consumption of each Spark job.

The .option("memory", mb) option is used as a best-effort memory budget that any particular job is allowed to consume, across on-heap and off-heap allocations. Increasing the available memory with this option can result in large performance improvements, provided the executors are configured with sufficient memory to avoid OOM failures.

Examples

To export a few genomic regions from a dataset (which must be accessible by the Spark jobs; here we assume it is located on S3):

Because there are only two regions specified, and two region partitions, each of the two Spark jobs started will read one of the given regions.

We can also place a list of sample names and a list of genomic regions to read in explicit text files, and use those during reading:

Here, we've also increased the memory budget to 8GB, and added partitioning over samples.

Perform Distributed Queries with Dask

Dask DataFrames

You can use the tiledbvcf package's Dask integration to partition read operations across regions and samples. The partitioning semantics are identical to those used by the CLI and Spark.

The result is a Dask dataframe (rather than a Pandas dataframe). We're using a local machine for simplicity but the API works on any Dask cluster.

Map Operations

If you plan to perform filter the results in a Dask dataframe, it may be more efficient to use map_dask() rather than read_dask(). The map_dask() function takes an additional parameter, fnc, allowing you to provide a filtering function that is applied immediately after performing the read but before inserting the result of the partition into the Dask dataframe.

In the following example, any variants overlapping regions in very-large-bedfile.bed are filtered out if their start position overlaps the first 25kb of the chromosome.

This approach can be more efficient than using read_dask() with a separate filtering step because it avoids the possibility that partitions require multiple read operations due to memory constraints.

The pseudocode describing the read_partition() algorithm (i.e., the code responsible for reading the partition on a Dask worker) is:

When using map_dask() instead, the pseudocode becomes:

You can see that if the provided filter_fnc() reduces the size of the data substantially, using map_dask() can reduce the likelihood that the Dask workers will run out of memory and avoid needing to perform multiple reads.

API Reference

This is the API reference for TileDB-VCF:

CLI

Available Commands

createa new dataset
storespecified VCF files in a dataset
exportdata from a dataset
listall sample names present in a dataset
statprints high-level statistics about a dataset
utils utility functions for dataset

Create

Create an empty TileDB-VCF dataset.

Usage

tiledbvcf create -u <uri> [-a <fields>] [-c <N>] [-g <N>] [--tiledb-config <params>] [--checksum <checksum>] [-n]

Options

Store

Ingests registered samples into a TileDB-VCF dataset.

Usage

tiledbvcf store -u <uri> [-t <N>] [-p <MB>] [-d <path>] [-s <MB>] [-n <N>] [-k <N>] [-b <MB>] [-v] [--remove-sample-file] [--tiledb-config <params>] ([-f <path>] | <paths>...) [-e <N>] [--stats] [--stats-vcf-header-array]

Options

Export

Exports data from a TileDB-VCF dataset.

Usage

tiledbvcf export -u <uri> [-O <format>] [-o <path>] [-t <fields>] ([-r <regions>] | [-R <path>]) [--sorted] [-n <N>] [-d <path>] [--sample-partition <I:N>] [--region-partition <I:N>] [--upload-dir <path>] [--tiledb-config <params>] [-v] [-c] [-b <MB>] ([-f <path>] | [-s <samples>]) [--stats] [--stats-vcf-header-array]

Options

List

Lists all sample names present in a TileDB-VCF dataset.

Usage

tiledbvcf list -u <uri> [--tiledb-config ]

Options

Stat

Prints high-level statistics about a TileDB-VCF dataset.

Usage

tiledbvcf stat -u <uri> [--tiledb-config ]

Options

Utils

Utils for working with a TileDB-VCF dataset, such for consolidating and vacuuming fragments or fragment metadata.

Usage

tiledbvcf utils (consolidate|vacuum) (fragment_meta|fragments) -u <uri> [--tiledb-config ]

Options

Python

`tiledbvcf.dataset`

This is the main Python module.

`Dataset`

Representation of the grouped TileDB arrays that constitute a TileDB-VCF dataset, which includes a sparse 3D array containing the actual variant data and a sparse 1D array containing various sample metadata and the VCF header lines. Read more about the data model here.

Dataset(self, uri, mode='r', cfg=None, stats=False, verbose=False)

Arguments

uri: URI of TileDB-VCF dataset
mode: (default 'r') Open the array object in read 'r' or write 'w' mode
cfg: TileDB-VCF configuration (optional)
stats: (bool) Enable of disable TileDB stats
verbose: (bool) Enable of disable TileDB-VCF verbose output

`create_dataset()`

Create a new TileDB-VCF dataset.

create_dataset(extra_attrs=None, tile_capacity=None, anchor_gap=None, checksum_type=None, allow_duplicates=True)

Arguments

extra_attrs: (list of str attrs) list of extra attributes to materialize from the FMT or INFO field. Names should be fmt_X or info_X for a field name X (case sensitive).
tile_capacity: (int) Tile capacity to use for the array schema (default 10000)
anchor_gap: (int) Length of gaps between inserted anchor records in bases (default = 1000)
checksum_type: (str checksum) Optional override checksum type for creating new dataset (valid values are 'sha256', 'md5' or None)
allow_duplicates: (bool) Controls whether records with duplicate end positions can be ingested written to the dataset

`ingest_samples()`

Ingest samples into an existing TileDB-VCF dataset.

ingest_samples(sample_uris=None, threads=None, thread_task_size=None, memory_budget_mb=None, scratch_space_path=None, scratch_space_size=None, record_limit=None, sample_batch_size=None)

Arguments:

sample_uris: (list of str samples) CSV list of VCF/BCF sample URIs to ingest
threads: (int) Set the number of threads used for ingestion
thread_task_size: (int) Set the max length (# columns) of an ingestion task (affects load balancing of ingestion work across threads and total memory consumption)
memory_budget_mb: (int) Set the max size (MB) of TileDB buffers before flushing (default 1024)
record_limit
str scratch_space_path: (str) Directory used for local storage of downloaded remote samples
scratch_space_size: (int) Amount of local storage that can be used for downloading remote samples (MB)
sample_batch_size: (int) Number of samples per batch for ingestion (default 10)
record_limit: Limit the number of VCF records read into memory per file (default 50000)
resume: (bool) Whether to check and attempt to resume a partial completed ingestion

`read()` / `read_arrow()`

Reads data from a TileDB-VCF dataset into a Pandas Dataframe (with read()) or a PyArrow Array (with read_arrow()).

read(attrs, samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False)

Arguments

attrs: (list of str attrs) List of attributes to extract. Can include attributes from the VCF INFO and FORMAT fields (prefixed with info_ and fmt_, respectively) as well as any of the builtin attributes:
- sample_name
- id
- contig
- alleles
- filters
- pos_start
- pos_end
- qual
- query_bed_end
- query_bed_start
- query_bed_line
samples: (list of str samples) CSV list of sample names to be read
regions: (list of str regions) CSV list of genomic regions to be read
samples_file: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file: (str filesystem location) URI of a BED file of genomic regions to be read
skip_check_samples: (bool) Should checking the samples requested exist in the array
disable_progress_estimation: (bool) Should we skip estimating the progress in verbose mode? Estimating progress can have performance or memory impacts in some cases.

Details

For large datasets, a call to read() may not be able to fit all results in memory. In that case, the returned dataframe will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

You can also use the Python generator version, read_iter().

Returns: Pandas DataFrame containing results.

`read_completed()`

read_completed()

Details

A read is considered complete if the resulting dataframe contained all results.

Returns: (bool) True if the previous read operation was complete

`count()`

Counts data in a TileDB-VCF dataset.

count(samples=None, regions=None)

Arguments

samples: (list of str samples) CSV list of sample names to include in the count
regions: (list of str regions) CSV list of genomic regions to include in the count

Details

Returns: Number of intersecting records in the dataset

`attributes()`

List queryable attributes available in the VCF dataset

attributes(attr_type = "all")

Arguments

attr_type: (list of str attributes) The subset of attributes to retrieve; "info" or "fmt" will only retrieve attributes ingested from the VCF INFO and FORMAT fields, respectively, "builtin" retrieves the static attributes defined in TileDB-VCF's schema, "all" (the default) returns all queryable attributes

Details

Returns: a list of strings representing the attribute names

`tiledbvcf.ReadConfig`

Set various configuration parameters.

ReadConfig(limit, region_partition, sample_partition, sort_regions, memory_budget_mb, tiledb_config)

Parameters

limit: max number of records (rows) to read
region_partition: Region partition tuple (idx, num_partitions)
sample_partition: Samples partition tuple (idx, num_partitions)
sort_regions: Whether or not to sort the regions to be read (default True)
memory_budget_mb: Memory budget (MB) for buffer and internal allocations (default 2048)
buffer_percentage: Percentage of memory to dedicate to TileDB Query Buffers (default: 25)
tiledb_tile_cache_percentage: Percentage of memory to dedicate to TileDB Tile Cache (default: 10)
tiledb_config: List of strings in the format "option=value" (see here for full list TileDB configuration parameters)

`tiledbvcf.dask`

This module is for the TileDB-VCF integration with Dask.

`read_dask()`

Reads data from a TileDB-VCF dataset into a Dask DataFrame.

read_dask(attrs, region_partitions=1, sample_partitions=1, samples=None, regions=None, samples_file=None, bed_file=None)

Arguments

attrs: (list of str attrs) List of attribute names to be read
region_partitions (int partitions) Number of partitions over regions
sample_partitions (int partitions) Number of partitions over samples
samples: (list of str samples) CSV list of sample names to be read
regions: (list of str regions) CSV list of genomic regions to be read
samples_file: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file: (str filesystem location) URI of a BED file of genomic regions to be read

Details

Partitioning proceeds by a straightforward block distribution, parameterized by the total number of partitions and the particular partition index that a particular read operation is responsible for.

Both region and sample partitioning can be used together.

Returns: Dask DataFrame with results

`map_dask()`

Maps a function on a Dask DataFrame obtained by reading from the dataset.

map_dask(fnc, attrs, region_partitions=1, sample_partitions=1, samples=None, regions=None, samples_file=None, bed_file=None)

Arguments

fnc: (function) Function applied to each partition
attrs: (list of str attrs) List of attribute names to be read
region_partitions (int partitions) Number of partitions over regions
sample_partitions (int partitions) Number of partitions over samples
samples: (list of str samples) CSV list of sample names to be read
regions: (list of str regions) CSV list of genomic regions to be read
samples_file: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file: (str filesystem location) URI of a BED file of genomic regions to be read

Details

May be more efficient in some cases than read_dask() followed by a regular Dask map operation.

Returns: Dask DataFrame with results

Python

`tiledbvcf.dataset`

This is the main Python module.

`Dataset`

Dataset(self, uri, mode='r', cfg=None, stats=False, verbose=False)

Arguments

uri: URI of TileDB-VCF dataset
mode: (default 'r') Open the array object in read 'r' or write 'w' mode
cfg: TileDB-VCF configuration (optional)
stats: (bool) Enable of disable TileDB stats
verbose: (bool) Enable of disable TileDB-VCF verbose output

`create_dataset()`

Create a new TileDB-VCF dataset.

create_dataset(extra_attrs=None, tile_capacity=None, anchor_gap=None, checksum_type=None, allow_duplicates=True)

Arguments

extra_attrs: (list of str attrs) list of extra attributes to materialize from the FMT or INFO field. Names should be fmt_X or info_X for a field name X (case sensitive).
tile_capacity: (int) Tile capacity to use for the array schema (default 10000)
anchor_gap: (int) Length of gaps between inserted anchor records in bases (default = 1000)
checksum_type: (str checksum) Optional override checksum type for creating new dataset (valid values are 'sha256', 'md5' or None)
allow_duplicates: (bool) Controls whether records with duplicate end positions can be ingested written to the dataset

`ingest_samples()`

Ingest samples into an existing TileDB-VCF dataset.

ingest_samples(sample_uris=None, threads=None, thread_task_size=None, memory_budget_mb=None, scratch_space_path=None, scratch_space_size=None, record_limit=None, sample_batch_size=None)

Arguments:

sample_uris: (list of str samples) CSV list of VCF/BCF sample URIs to ingest
threads: (int) Set the number of threads used for ingestion
thread_task_size: (int) Set the max length (# columns) of an ingestion task (affects load balancing of ingestion work across threads and total memory consumption)
memory_budget_mb: (int) Set the max size (MB) of TileDB buffers before flushing (default 1024)
record_limit
str scratch_space_path: (str) Directory used for local storage of downloaded remote samples
scratch_space_size: (int) Amount of local storage that can be used for downloading remote samples (MB)
sample_batch_size: (int) Number of samples per batch for ingestion (default 10)
record_limit: Limit the number of VCF records read into memory per file (default 50000)
resume: (bool) Whether to check and attempt to resume a partial completed ingestion

`read()` / `read_arrow()`

Reads data from a TileDB-VCF dataset into a Pandas Dataframe (with read()) or a PyArrow Array (with read_arrow()).

read(attrs, samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False)

Arguments

attrs: (list of str attrs) List of attributes to extract. Can include attributes from the VCF INFO and FORMAT fields (prefixed with info_ and fmt_, respectively) as well as any of the builtin attributes:
- sample_name
- id
- contig
- alleles
- filters
- pos_start
- pos_end
- qual
- query_bed_end
- query_bed_start
- query_bed_line
samples: (list of str samples) CSV list of sample names to be read
regions: (list of str regions) CSV list of genomic regions to be read
samples_file: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file: (str filesystem location) URI of a BED file of genomic regions to be read
skip_check_samples: (bool) Should checking the samples requested exist in the array
disable_progress_estimation: (bool) Should we skip estimating the progress in verbose mode? Estimating progress can have performance or memory impacts in some cases.

Details

You can also use the Python generator version, read_iter().

Returns: Pandas DataFrame containing results.

`read_completed()`

read_completed()

Details

A read is considered complete if the resulting dataframe contained all results.

Returns: (bool) True if the previous read operation was complete

`count()`

Counts data in a TileDB-VCF dataset.

count(samples=None, regions=None)

Arguments

samples: (list of str samples) CSV list of sample names to include in the count
regions: (list of str regions) CSV list of genomic regions to include in the count

Details

Returns: Number of intersecting records in the dataset

`attributes()`

List queryable attributes available in the VCF dataset

attributes(attr_type = "all")

Arguments

attr_type: (list of str attributes) The subset of attributes to retrieve; "info" or "fmt" will only retrieve attributes ingested from the VCF INFO and FORMAT fields, respectively, "builtin" retrieves the static attributes defined in TileDB-VCF's schema, "all" (the default) returns all queryable attributes

Details

Returns: a list of strings representing the attribute names

`tiledbvcf.ReadConfig`

Set various configuration parameters.

ReadConfig(limit, region_partition, sample_partition, sort_regions, memory_budget_mb, tiledb_config)

Parameters

limit: max number of records (rows) to read
region_partition: Region partition tuple (idx, num_partitions)
sample_partition: Samples partition tuple (idx, num_partitions)
sort_regions: Whether or not to sort the regions to be read (default True)
memory_budget_mb: Memory budget (MB) for buffer and internal allocations (default 2048)
buffer_percentage: Percentage of memory to dedicate to TileDB Query Buffers (default: 25)
tiledb_tile_cache_percentage: Percentage of memory to dedicate to TileDB Tile Cache (default: 10)
tiledb_config: List of strings in the format "option=value" (see here for full list TileDB configuration parameters)

`tiledbvcf.dask`

This module is for the TileDB-VCF integration with Dask.

`read_dask()`

Reads data from a TileDB-VCF dataset into a Dask DataFrame.

read_dask(attrs, region_partitions=1, sample_partitions=1, samples=None, regions=None, samples_file=None, bed_file=None)

Arguments

attrs: (list of str attrs) List of attribute names to be read
region_partitions (int partitions) Number of partitions over regions
sample_partitions (int partitions) Number of partitions over samples
samples: (list of str samples) CSV list of sample names to be read
regions: (list of str regions) CSV list of genomic regions to be read
samples_file: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file: (str filesystem location) URI of a BED file of genomic regions to be read

Details

Partitioning proceeds by a straightforward block distribution, parameterized by the total number of partitions and the particular partition index that a particular read operation is responsible for.

Both region and sample partitioning can be used together.

Returns: Dask DataFrame with results

`map_dask()`

Maps a function on a Dask DataFrame obtained by reading from the dataset.

map_dask(fnc, attrs, region_partitions=1, sample_partitions=1, samples=None, regions=None, samples_file=None, bed_file=None)

Arguments

fnc: (function) Function applied to each partition
attrs: (list of str attrs) List of attribute names to be read
region_partitions (int partitions) Number of partitions over regions
sample_partitions (int partitions) Number of partitions over samples
samples: (list of str samples) CSV list of sample names to be read
regions: (list of str regions) CSV list of genomic regions to be read
samples_file: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file: (str filesystem location) URI of a BED file of genomic regions to be read

Details

May be more efficient in some cases than read_dask() followed by a regular Dask map operation.

Returns: Dask DataFrame with results