The de facto file format for storing genomic variant call data is VCF. This comes in two flavors, single-sample and multi-sample VCF (other names include combined VCF and project VCF, etc.). Below we explain the problems with each of those approaches, as well as the data engineering effort involved when storing genomic data in a legacy, non-interoperable file format.
Genomic analyses performed on collections of single-sample VCF files typically involve retrieving variant information from particular genomic ranges, for a specific subsets of samples, along with any of the provided VCF fields/attributes. Such queries are often repeated millions of times, so it is imperative that data retrieval is performed in a fast, scalable, and cost-effective manner.
However, accessing random locations in thousands—or hundreds of thousands—of different files is prohibitively expensive, both computationally and monetarily. This is especially true on cloud object stores, where each byte range in a file is a separate request that goes over the network. Not only does this introduce non-negligible latency, it can also incur significant costs as cloud object stores charge for every such request. For a typical analysis involving millions of requests on large collections of VCF files, this quickly becomes unsustainable.
The problems with single-sample VCF collections was the motivation behind multi-sample VCF, or project VCF (pVCF), files in which the entire collection of single-sample VCFs is merged into a single file. When indexed, specific records can be located within a pVCF file very quickly, as data retrieval is reduced to a simple, super-fast linear scan, minimizing latency, and significantly boosting I/O performance. However, this approach comes with significant costs.
First, the size of multi-sample pVCF files can scale superlinearly with the number of included samples. The problem is that individual VCF files contain very sparse data, which the pVCF file densifies by adding dummy information to denote variants missing from the original single-sample VCF file. This means that the combined pVCF solution is not scalable because it can lead to an explosion of storage overhead and high merging cost for large population studies.
Another problem is that the multi-sample pVCF file cannot be updated. Due to the fact that the sample-related information is listed in the last columns of the file, a new sample insertion will need to create a new column and inject values at the end of every line. This effectively means that a brand new pVCF will need to be generated with every update. If pVCF is large (typically in the order of many GBs or even TBs), the insertion process will be extremely slow. This is often referred to as the N+1 problem.
Regardless of whether you deal with single- or multi-sample VCF, the typical way of accessing these files is via custom CLI tools, such as bcftools. Those tools are extremely useful for analysis that can be performed on a local machine (e.g., a laptop). However, they become unwieldy when scalable analysis necessitates the use of numerous machines working in parallel. Spinning up new nodes, deploying the software and orchestrating the parallel analysis consumes the majority of the time of researchers and hinders progress.
Furthermore, domain-specific tools either re-invent a lot of the work that has been done in general-purpose data science tools, or they miss out on them. This leads the researchers to writing custom code for converting the VCF data into a format that a programming language (e.g., Python and R) or a data science tool (e.g., pandas or Spark) understands. This data access and conversion cost often becomes the bottleneck, instead of the actual analysis.
Finally, access control on flat files and logging of activity with explicable semantics is extremely challenging. This poses great obstacles to collaboration within and across organizations. Effectively a brand new infrastructure needs to be built on top of the VCF files to provide foundational data management functionality, similar to what traditional databases offer for tabular data.
TileDB-VCF and TileDB Cloud were developed to address these challenges. Read the Solution page to find out about how we managed to do so.