Introduction

TileDB offers a powerful solution for storing and analyzing at scale very large amounts of genomic variant data (gVCF), called TileDB-VCF. Our vision is to enable large-scale genomics applications at a low cost.

Genomic information is traditionally stored in a collection of gVCF files that limit the scalability of common operations like data merging. TileDB-VCF models the same information as a rapidly updatable sparse matrix, solving the notorious N+1 problem, and allowing to arbitrarily scale population genetics analysis to millions of samples in a cloud-based cost-effective manner. TileDB-VCF is written in C++ and comes with C++ and Python APIs (and many more to soon follow), as well as integration with Spark and Dask. It allows the user to utilize a plethora of tools from the Data Science ecosystem in a seamless and natural fashion. All these, without having to convert data from one format to another or have to use unfamiliar APIs, while always enjoying top performance.

TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute.. The resulting software, GenomicsDB, became part of Broad's GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of TileDB, incorporating new algorithms, features and optimizations. TileDB-VCF is the product of a collaboration between TileDB, Inc. and Helix.

The TileDB-VCF library exposes several interfaces for ingestion and reading of variant data, linking with the TileDB storage manager library itself for persisting the data. TileDB-VCF uses htslib during the ingestion phase to parse the VCF/BCF files, and thereafter uses a custom algorithm for ingestion of the data as well as extraction.

Once ingested, TileDB-VCF allows you to rapidly query variant data by genomic regions across arbitrary numbers of samples, returning all VCF records that intersect any of the query regions. TileDB-VCF takes advantage of all features of core TileDB, such as cloud-optimized queries, parallelism, etc. With TileDB-VCF you can enjoy linear scalability for VCF data storage, and many times faster queries than using bcftools -R or other existing methods.