Population Genomics

Our vision is to facilitate fast, large-scale genomics research at a fraction of the cost by providing infrastructure that is easy to setup and designed for extreme scale.

Storing genomic variant call data in a collection of VCF files can severely impact the performance of analyses at population scale. It also create significant data management hassles that revolve around scalable compute, governed access and logging.

Motivated by these issues, we developed TileDB-VCF, an open-source C++ library for efficient storage and retrieval of genomic variant call data based on (also open-source) TileDB Embedded. Coupled with TileDB Cloud, TileDB-VCF is a powerful solution to storing, managing, analyzing and securely sharing enormous volumes of genomic variant data, at unmatched scale.

TileDB-VCF Features

TileDB-VCF offers the following unique capabilities:

  • Performance: TileDB-VCF can efficiently store any number of samples in a compressed and lossless manner. It can rapidly add new samples (eliminating the so-called N+1 problem), scaling both storage and update time linearly with the number of new samples. TileDB-VCF is optimized for rapidly querying variant records by genomic regions across arbitrary numbers of samples. Built upon TileDB Embedded, it inherits all its features out of the box, such as the speed and optimizations on numerous storage backends, including cloud object stores. Additional features for ingesting VCF files and performing genomic interval intersections are all implemented in C++ for maximum speed.

  • Tooling flexibility: In addition to a command-line interface, TileDB-VCF provides C++ and Python APIs, as well as integrations with Spark and Dask. This allows the user to utilize a plethora of tools from the Data Science ecosystem in a seamless and natural fashion, without having to convert data from one format to another or use unfamiliar APIs.

  • Open-source: TileDB Embedded and TileDB-VCF are both open source and developed under the permissive MIT license. We are passionate about supporting the scientific community with open source software and welcome feedback and contributions to further improve our projects.

  • Battle-tested: TileDB-VCF has been used in production by Helix to manage datasets comprising hundreds of thousands of exomes.

TileDB-VCF on TileDB Cloud

TileDB-VCF combined with TileDB Cloud can take genomics research to new heights. More specifically, TileDB Cloud adds the following capabilities:

  • Governance: Securely share your genomics datasets with anyone within or beyond your organization, with detailed logging. Define your permissions and audit all activity.

  • Collaboration: Run Jupyter notebooks embedded in the TileDB Cloud console. Share your data, user-defined functions and notebooks with any other user TileDB Cloud user, even outside your organization. Explore public datasets and code prepared by others, and contribute your own.

  • Hassle-free scalability: Scale your analyses with unprecedented ease, leveraging the TileDB Cloud's totally serverless, totally multi-tenant platform. No more spinning up and managing clusters. Ingest and slice in parallel, or design and submit arbitrary task flows comprised of thousands of tasks. TileDB Cloud will automatically scale your compute with extreme elasticity. TileDB Cloud will respect the geographic regions in which you store the data and automatically ship the compute inside those regions to comply with data policies.

  • Low cost: TileDB Cloud charges on a pay-as-you-go fashion only for the useful compute, and never for any idle time. No need to maintain clusters 24/7. Just submit your tasks and TileDB Cloud will take over the rest without the need for resource provisioning a priori.

Getting Started

  • Learn more about the problem

  • Learn about our solution and data model

  • Run the TileDB-VCF Basics tutorial notebook

  • Find more details in Usage and API Reference

  • Explore public datasets and code on TileDB Cloud

  • Contribute your own data and code!

History

TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute. The resulting software, GenomicsDB, became part of Broad's GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of TileDB Embedded, incorporating new algorithms, features and optimizations. TileDB-VCF is the product of a collaboration between TileDB, Inc. and Helix.