Population Genomics
Our vision is to facilitate fast, large-scale genomics research at a fraction of the cost by providing infrastructure that is easy to setup and designed for extreme scale.
Storing genomic variant call data in a collection of VCF files can severely impact the performance of analyses at population scale. Motivated by these issues, we developed TileDB-VCF
The problem
In a nutshell: Storing vast quantities of genomic variant samples in legacy file formats leads to slow access, especially on the cloud. Merging the files into multi-sample files is not scalable or maintainable, resulting in the so called N+1 problem.
The de facto file format for storing genomic variant call data is VCF. This comes in two flavors, single-sample and multi-sample VCF (other names include combined VCF and project VCF, etc). Below we explain the problems with each of those approaches, as well as the data engineering effort involved when storing genomic data in a legacy, non-interoperable file format.
Single-sample VCF
Genomic analyses performed on collections of single-sample VCF files typically involve retrieving variant information from particular genomic ranges, for a specific subsets of samples, along with any of the provided VCF fields/attributes. Such queries are often repeated millions of times, so it is imperative that data retrieval is performed in a fast, scalable, and cost-effective manner.
However, accessing random locations in thousands—or hundreds of thousands—of different files is prohibitively expensive, both computationally and monetarily. This is especially true on cloud object stores, where each byte range in a file is a separate request that goes over the network. Not only does this introduce non-negligible latency, it can also incur significant costs as cloud object stores charge for every such request. For a typical analysis involving millions of requests on large collections of VCF files, this quickly becomes unsustainable.
Multi-sample VCF
The problems with single-sample VCF collections was the motivation behind multi-sample VCF, or project/population VCF (pVCF), files in which the entire collection of single-sample VCFs is merged into a single file. When indexed, specific records can be located within a pVCF file very quickly, as data retrieval is reduced to a simple, super-fast linear scan, minimizing latency, and significantly boosting I/O performance. However, this approach comes with significant costs.
First, the size of multi-sample pVCF files can scale superlinearly with the number of included samples. The problem is that individual VCF files contain very sparse data, which the pVCF file densifies by adding dummy information to denote variants missing from the original single-sample VCF file. This means that the combined pVCF solution is not scalable because it can lead to an explosion of storage overhead and high merging cost for large population studies.
Another problem is that the multi-sample pVCF file cannot be updated. Due to the fact that the sample-related information is listed in the last columns of the file, a new sample insertion will need to create a new column and inject values at the end of every line. This effectively means that a brand new pVCF will need to be generated with every update. If pVCF is large (typically in the order of many GBs or even TBs), the insertion process will be extremely slow. This is often referred to as the N+1 problem.
Data accessibility
Regardless of whether you deal with single- or multi-sample VCF, the typical way of accessing these files is via custom CLI tools, such as bcftools. Those tools are extremely useful for analysis that can be performed on a local machine (e.g., a laptop). However, they become unwieldy when scalable analysis necessitates the use of numerous machines working in parallel. Spinning up new nodes, deploying the software and orchestrating the parallel analysis consumes the majority of the time of researchers and hinders progress.
Furthermore, domain-specific tools either re-invent a lot of the work that has been done in general-purpose data science tools, or they miss out on them. This leads the researchers to writing custom code for converting the VCF data into a format that a programming language (e.g., Python and R) or a data science tool (e.g., pandas or Spark) understands. This data access and conversion cost often becomes the bottleneck, instead of the actual analysis.
The solution with TileDB
Modeling with 3D sparse arrays
Population variant data can be efficiently represented using a 3D sparse array. For each sample, imagine a 2D plane where the vertical axis is the contig and the horizontal axis is the genomic position. Every variant can be represented as a range within this plane; it can be unary (i.e., a SNP) or it can be a longer range (e.g., INDEL or CNV). Each sample is then indexed by a third dimension, which is unbounded to accommodate populations of any size. The figure below shows an example for one sample, with several variants distributed across contigs chr1
, chr2
and chr3
.
In TileDB-VCF, we represent the start position of each range as a non-empty cell in a sparse array (black squares in the above figure). In each of those array cells, we store the end position of each cell (to create a range) along with all other fields in the corresponding single-sample VCF files for each variant (e.g., REF
, ALT
, etc.). Therefore, for every sample, we map variants to 2D non-empty sparse array cells.
To facilitate rapid retrieval of interval intersections (explained in the next section), we also inject anchors (green square in the above figure) to breakup long ranges. Specifically, we create a new non-empty cell every anchor_gap
bases from the start of the range (where anchor_gap
is a user-defined parameter), which is identical to the range cell, except that (1) it has a new start coordinate and (2) it stores the real start position in an attribute.
Note that regardless of the number of samples, we do not inject any additional information other than that of the anchors, which is user configurable and turns out to be negligible for real datasets. In other words, this solution leads to linear storage in the number of samples, thus being scalable.
Fast retrieval
The typical access pattern used for variant data involves one or more rectangles covering a set of genomic ranges across one or more samples. In the figure below, let the black rectangle be the user's query. Observe that the results are highlighted in blue (v1, v2, v4, v7
). However, the rectangle misses v1
, i.e., the case where an Indel/CNV range intersects the query rectangle, but the start position is outside the rectangle.
This is the motivation behind anchors. TileDB-VCF expands the user's query range on the left by anchor_gap
. It then reports as results the cells that are included in the expanded query if their end position (stored in an attribute) comes after the query range start endpoint. In the example above, TileDB-VCF retrieves anchor a1
and Indel/CNV v3
. It reports v1
as a result (as it can be extracted directly from anchor a1
), but filters out v3
.
Quite often, the analyses requires data retrieval based on numerous query ranges (up to the order of millions), which must be submitted simultaneously. TileDB-VCF leverages the highly optimized multi-range subarray
But what about updates? That's the topic of the next sub section.
Updates
Interoperability
TileDB-VCF is built in C++ for performance, and comes with an efficient Python API, a command-line tool (CLI) and integrations with Spark and Dask. We are working hard to expand on the APIs through which our users can access the variant data stored in the open-spec TileDB format, and we are happy to accommodate user requests (which can be filed here).
Being able to access the data in various programming languages and computational frameworks unlocks a vast opportunity for utilizing the numerous existing open-source data science and machine learning tools.
What's next?
You can find more detailed documentation about TileDB-VCF (including the installation instructions, API reference and How To guides) in the following dedicated section:
Last updated