TileDB-VCF is an open-source C++ library for efficient storage and retrieval of genomic variant call data based on (also open-source) TileDB Embedded.
TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute. The resulting software, GenomicsDB, became part of Broad's GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of TileDB Embedded, incorporating new algorithms, features and optimizations. TileDB-VCF is the product of a collaboration between TileDB, Inc. and Helix.
Why use TileDB-VCF?
TileDB-VCF offers several important benefits:
Performance: TileDB-VCF is optimized for rapidly slicing variant records by genomic regions across arbitrary numbers of samples. Additional features for ingesting VCF files and performing genomic interval intersections are all implemented in C++ for maximum speed.
Compressibility: TileDB-VCF can efficiently store any number of samples in a compressed and lossless manner. Being a columnar format, TileDB-VCF can compress different VCF fields with different compressors, choosing the most appropriate one depending on the data types and nature of the data per field.
Optimized for cloud object stores: Built on TileDB Embedded, TileDB-VCF inherits all its features out of the box, such as the speed and optimizations on numerous storage backends, including cloud object stores such as AWS S3, Azure Blob Storage and Google Cloud Storage.
Updatability: TileDB-VCF allows you to rapidly add new samples to existing datasets, eliminating the so-called N+1 problem and scaling both storage and update time linearly with the number of new samples.
Interoperability: In addition to a command-line interface, TileDB-VCF provides C++ and Python APIs, as well as integrations with Spark and Dask. This allows the user to utilize a plethora of tools from the Data Science and Machine Learning ecosystems in a seamless and natural fashion, without having to convert data from one format to another or use unfamiliar APIs.
Open-source: TileDB Embedded and TileDB-VCF are both open source and developed under the permissive MIT license. We are passionate about supporting the scientific community with open-source software and welcome feedback and contributions to further improve our projects.
Battle-tested: TileDB-VCF has been used in production by very large customers to manage immense amounts of variant data, in the order of many hundreds of thousands of samples.
The documentation of TileDB-VCF includes information about the data model (and its mapping to 3D sparse arrays), installation instructions, How To guides and the API reference: