1 of 9

How To

This section includes useful guides about the usage of TileDB-VCF:

Create a Dataset

The first step before ingesting any VCF samples is to create a dataset. This effectively creates a TileDB group and the appropriate empty arrays in it.

import tiledbvcf

uri = "my_vcf_dataset" 
ds = tiledbvcf.Dataset(uri, mode = "w") # sets dataset to "Write" mode
ds.create_dataset()                     # creates the dataset and
                                        # keeps it in "Write" mode

tiledbvcf create --uri my_vcf_dataset

If you wish to turn some of the INFO and FMT fields into separate materialized attributes, you can do so as follows (names should be fmt_X or info_X for a field name X - case sensitive).

import tiledbvcf

uri = "my_vcf_dataset" 
ds = tiledbvcf.Dataset(uri, mode = "w") 
ds.create_dataset(extra_attrs=["info_AA"])

tiledbvcf create --uri my_vcf_dataset --attributes info_AA

Ingest Samples

Indexed files are required for ingestion. If your VCF/BCF files have not been indexed, you can use to do so:

You can ingest samples into an already created dataset as follows:

Just add a regular expression for the VCF file locations at the end of the store command:

Alternatively, provide a text file with the absolute locations of the VCF files, separated by a new line:

Incremental updates work in the same manner as the ingestion above, nothing special is needed. In addition, the ingestion is thread- and process-safe and, therefore, can be performed in parallel.

Read from the Dataset

Basic Utils

Before slicing the data, you may wish to get some information about your dataset, such as the sample names, the attributes you can query, etc.

You can get the sample names as follows:

import tiledbvcf

uri = "my_vcf_dataset" 
ds = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds.samples()

tiledbvcf list -u my_vcf_dataset

You can get the attributes as follows:

import tiledbvcf

uri = "my_vcf_dataset" 
ds = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds.attributes()                      # will print all queryable attributes
ds.attributes(attr_type = "builtin") # will print all materialized attributes

tiledbvcf stat -u my_vcf_datset

Reading

You can rapidly read from a TileDB-VCF dataset by providing three main parameters (all optional):

A subset of the samples
A subset of the attributes
One or more genomic ranges
1. Either as strings in format chr:pos_range
2. Or via a BED file

import tiledbvcf

uri = "my_vcf_dataset" 
ds = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds.read(
    attrs = ["alleles", "pos_start", "pos_end"],
    regions = ["1:113409605-113475691", "1:113500000-113600000"],
    # or pass regions as follows:
    # bed_file = <bed_filename>
    samples = ['HG0099', 'HG00100']
    # Set allele frequency filter with
    # set_af_filter="<0.5"
)

tiledbvcf export \
    --uri my_vcf_dataset \
    --output-format t \
    --tsv-fields ALT,Q:POS,Q:END
    --sample-names HG0099,HG00100
    --regions 1:113409605-113475691,1:113500000-113600000
    # or pass the regions in a BED file as follows:
    # --regions-file <bed_filename>

Export to VCF

You can use the TileDB-VCF CLI to export the TileDB-VCF ingested dataset back into VCF formats for downstream analyses, in a lossless way.

While these exports are lossless in terms of the actual data stored, they may not be identical to the original files. For example, fields within the INFO and FORMAT columns may appear in a slightly different order in the exported files.

Basic export

To recreate all of original (single-sample) VCF files simply run the export command and set the --output-formatto v, for VCF.

If bcftools is available on your system you can use it to easily examine any of the exported files:

Filtering variants

The same mechanics covered in the for filtering records by sample and genomic region also apply to exporting VCF files.

Handle Large Queries

Unlike TileDB-VCF's CLI, which exports directly to disk, results for queries performed using Python are read into memory. Therefore, when querying even moderately sized genomic datasets, the amount of available memory must be taken into consideration.

This guide demonstrates several of the TileDB-VCF features for overcoming memory limitations when querying large datasets.

Setting the Memory Budget

One strategy for accommodating large queries is to simply increase the amount of memory available to tiledbvcf. By default tiledbvcf allocates 2GB of memory for queries. However, this value can be adjusted using the memory_budget_mb parameter. For the purposes of this tutorial the budget will be decreased to demonstrate how tiledbvcf is able to perform genome-scale queries even in a memory constrained environment.

Performing Batched Reads

For queries that encompass many genomic regions you can simply provide an external bed file. In this example, you will query for any variants located in the promoter region of a known gene located on chromosomes 1-4.

After performing a query, you can use read_completed() to verify whether or not all results were successfully returned.

In this case, it returned False, indicating the requested data was too large to fit into the allocated memory so tiledbvcf retrieved as many records as possible in this first batch. The remaining records can be retrieved using continue_read(). Here, we've setup our code to accommodate the possibility that the full set of results are split across multiple batches.

Here is the final dataframe, which includes 3,808,687 records:

Iteration

A Python generator version of the read method is also provided. This pattern provides a powerful interface for batch processing variant data.

Work with Cloud Object Stores

TileDB Open Source provides native support for reading from and writing to cloud object stores like AWS S3, Google Cloud Storage, and Microsoft Azure Blob Store. This guide will cover some considerations for using TileDB-VCF with these services. The examples will focus exclusively on S3, which is the most widely used, but note any of the aforementioned services can be substituted, as well as on-premise services like MinIO that provide S3-compatible APIs.

Remote Datasets

The process of creating a TileDB-VCF dataset on S3 is nearly identical to creating a local dataset. The only difference being an s3:// address is passed to the --uri argument rather than a local file path.

tiledbvcf create --uri s3://my-bucket/my_dataset

This also works when querying a TileDB-VCF dataset located on S3.

tiledbvcf export \
  --uri s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20 \
  --sample-names v2-tJjMfKyL,v2-eBAdKwID \
  -Ot --tsv-fields "CHR,POS,REF,S:GT" \
  --regions "chr7:144000320-144008793,chr11:56490349-56491395"

Remote VCF Files

VCF files located on S3 can be ingested directly into a TileDBVCF dataset using 1 of 2 different possible approaches.

Direct Ingestion

The first approach is the easiest, you simply pass the tiledbvcf store command a list of S3 URIs and TileDB-VCF takes care of the rest:

tiledbvcf store \
    --uri my_dataset \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

In this approach, remote VCF index files (which are relatively tiny) are downloaded locally, allowing TileDB-VCF to retrieve chunks of variant data from the remote VCF files without having to download them in full. By default, index files are downloaded to your current working directory, however, you can choose to store them in different location (e.g., a temporary directory) using the --scratch-dir argument.

Batched Downloading

The second approach is to download batches of VCF files in their entirety before ingestion, which may slightly improve ingestion performance. This approach requires allocating TileDB-VCF with scratch disk space using the --scratch-mb and --scratch-dir arguments.

tiledbvcf store \
    --uri my_dataset \
    --scratch-dir "$TMPDIR" \
    --scratch-mb 4096 \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

The number of VCF files that are downloaded at a time is determined by the --sample-batch-size parameter, which defaults to 10. Downloading and ingestion happens asynchronously, so, for example, batch 3 will be downloaded as batch 2 is being ingestion. As a result, you must configure enough scratch space to store at least 20 samples, assuming a batch size of 10.

Authentication

For TileDB to access a remote storage bucket you must be properly authenticated on the machine running TileDB. For S3, this means having access to the appropriate AWS access key ID and secret access key. This typically happens in one of three ways:

1. Using the AWS CLI

If the AWS Command Line Interface (CLI) is installed on your machine, running aws configure will store your credentials in a local profile that TileDB can access. You can verify the CLI has been previously configured by running:

aws s3 ls

If properly configured, this will output a list of the S3 buckets you (and thus TileDB) can access.

2. Using Configuration Parameters

You can pass your AWS access key ID and secret access key to TileDB-VCF directly via the --tiledb-config argument, which expects a comma-separated string:

tiledbvcf store \
    --uri my_dataset \
    --tiledb-config vfs.s3.aws_access_key_id=<id>,vfs.s3.aws_secret_access_key=<secret> \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

3. Using Environment Variables

Your AWS credentials can also be passed to TileDB by defining the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Perform Distributed Queries with Spark

Reading

TileDB-VCF's Spark API offers a DataSourceV2 data source to read TileDB-VCF datasets into a Spark dataframe. To begin, launch a Spark shell with:

Depending on the size of the dataset and query results, you may need to increase the configured memory from the defaults.:

While the Spark API offers much of the same functionality provided by the CLI, the main considerations when using the Spark API are typically dataframe partitioning and memory overhead.

Partitioning

There are two ways to partition a TileDB-VCF dataframe, which can be used separately or together:

Partitioning over samples (.option("sample_partitions", N)).
Partitioning over genomic regions (.option("range_partitions", M)).

Conceptually, these correspond to partitioning over rows and columns, respectively, in the underlying TileDB array. For example, if you are reading a subset of 200 samples in a dataset, and specify 10 sample partitions, Spark will create 10 jobs, each of which will be responsible for handling the export from 20 of the 200 selected samples.

Similarly, if the provided BED file contains 1,000,000 regions, and you specify 10 region partitions, Spark will create 10 jobs, each of which will be responsible for handling the export of 100,000 of the 1,000,000 regions (across all samples).

These two partitioning parameters can be composed to form rectangular regions of work to distribute across the available Spark executors.

The CLI interface offers the same partitioning feature, using the --sample-partition and --region-partition flags.

Memory

Because TileDB-VCF is implemented as a native library, the Spark API makes use of both JVM on-heap and off-heap memory. This can make it challenging to correctly tune the memory consumption of each Spark job.

The .option("memory", mb) option is used as a best-effort memory budget that any particular job is allowed to consume, across on-heap and off-heap allocations. Increasing the available memory with this option can result in large performance improvements, provided the executors are configured with sufficient memory to avoid OOM failures.

Examples

To export a few genomic regions from a dataset (which must be accessible by the Spark jobs; here we assume it is located on S3):

Because there are only two regions specified, and two region partitions, each of the two Spark jobs started will read one of the given regions.

We can also place a list of sample names and a list of genomic regions to read in explicit text files, and use those during reading:

Here, we've also increased the memory budget to 8GB, and added partitioning over samples.

Perform Distributed Queries with Dask

The tiledbvcf Python package includes integration with to enable distributing large queries across node clusters.

Dask DataFrames

You can use the tiledbvcf package's Dask integration to partition read operations across regions and samples. The partitioning semantics are identical to those used by the CLI and Spark.

The result is a Dask dataframe (rather than a Pandas dataframe). We're using a local machine for simplicity but the API works on any Dask cluster.

Map Operations

If you plan to perform filter the results in a Dask dataframe, it may be more efficient to use map_dask() rather than read_dask(). The map_dask() function takes an additional parameter, fnc, allowing you to provide a filtering function that is applied immediately after performing the read but before inserting the result of the partition into the Dask dataframe.

In the following example, any variants overlapping regions in very-large-bedfile.bed are filtered out if their start position overlaps the first 25kb of the chromosome.

This approach can be more efficient than using read_dask() with a separate filtering step because it avoids the possibility that partitions require multiple read operations due to memory constraints.

The pseudocode describing the read_partition() algorithm (i.e., the code responsible for reading the partition on a Dask worker) is:

When using map_dask() instead, the pseudocode becomes:

You can see that if the provided filter_fnc() reduces the size of the data substantially, using map_dask() can reduce the likelihood that the Dask workers will run out of memory and avoid needing to perform multiple reads.