We currently focus on the following use cases in Genomics:
Our vision is to facilitate fast, large-scale genomics research at a fraction of the cost by providing infrastructure that is easy to setup and designed for extreme scale.
Storing genomic variant call data in a collection of VCF files can severely impact the performance of analyses at population scale. Motivated by these issues, we developed TileDB-VCF
In a nutshell: Storing vast quantities of genomic variant samples in legacy file formats leads to slow access, especially on the cloud. Merging the files into multi-sample files is not scalable or maintainable, resulting in the so called N+1 problem.
The de facto file format for storing genomic variant call data is VCF. This comes in two flavors, single-sample and multi-sample VCF (other names include combined VCF and project VCF, etc). Below we explain the problems with each of those approaches, as well as the data engineering effort involved when storing genomic data in a legacy, non-interoperable file format.
Genomic analyses performed on collections of single-sample VCF files typically involve retrieving variant information from particular genomic ranges, for a specific subsets of samples, along with any of the provided VCF fields/attributes. Such queries are often repeated millions of times, so it is imperative that data retrieval is performed in a fast, scalable, and cost-effective manner.
However, accessing random locations in thousands—or hundreds of thousands—of different files is prohibitively expensive, both computationally and monetarily. This is especially true on cloud object stores, where each byte range in a file is a separate request that goes over the network. Not only does this introduce non-negligible latency, it can also incur significant costs as cloud object stores charge for every such request. For a typical analysis involving millions of requests on large collections of VCF files, this quickly becomes unsustainable.
The problems with single-sample VCF collections was the motivation behind multi-sample VCF, or project/population VCF (pVCF), files in which the entire collection of single-sample VCFs is merged into a single file. When indexed, specific records can be located within a pVCF file very quickly, as data retrieval is reduced to a simple, super-fast linear scan, minimizing latency, and significantly boosting I/O performance. However, this approach comes with significant costs.
First, the size of multi-sample pVCF files can scale superlinearly with the number of included samples. The problem is that individual VCF files contain very sparse data, which the pVCF file densifies by adding dummy information to denote variants missing from the original single-sample VCF file. This means that the combined pVCF solution is not scalable because it can lead to an explosion of storage overhead and high merging cost for large population studies.
Another problem is that the multi-sample pVCF file cannot be updated. Due to the fact that the sample-related information is listed in the last columns of the file, a new sample insertion will need to create a new column and inject values at the end of every line. This effectively means that a brand new pVCF will need to be generated with every update. If pVCF is large (typically in the order of many GBs or even TBs), the insertion process will be extremely slow. This is often referred to as the N+1 problem.
Regardless of whether you deal with single- or multi-sample VCF, the typical way of accessing these files is via custom CLI tools, such as bcftools. Those tools are extremely useful for analysis that can be performed on a local machine (e.g., a laptop). However, they become unwieldy when scalable analysis necessitates the use of numerous machines working in parallel. Spinning up new nodes, deploying the software and orchestrating the parallel analysis consumes the majority of the time of researchers and hinders progress.
Furthermore, domain-specific tools either re-invent a lot of the work that has been done in general-purpose data science tools, or they miss out on them. This leads the researchers to writing custom code for converting the VCF data into a format that a programming language (e.g., Python and R) or a data science tool (e.g., pandas or Spark) understands. This data access and conversion cost often becomes the bottleneck, instead of the actual analysis.
Population variant data can be efficiently represented using a 3D sparse array. For each sample, imagine a 2D plane where the vertical axis is the contig and the horizontal axis is the genomic position. Every variant can be represented as a range within this plane; it can be unary (i.e., a SNP) or it can be a longer range (e.g., INDEL or CNV). Each sample is then indexed by a third dimension, which is unbounded to accommodate populations of any size. The figure below shows an example for one sample, with several variants distributed across contigs chr1
, chr2
and chr3
.
In TileDB-VCF, we represent the start position of each range as a non-empty cell in a sparse array (black squares in the above figure). In each of those array cells, we store the end position of each cell (to create a range) along with all other fields in the corresponding single-sample VCF files for each variant (e.g., REF
, ALT
, etc.). Therefore, for every sample, we map variants to 2D non-empty sparse array cells.
To facilitate rapid retrieval of interval intersections (explained in the next section), we also inject anchors (green square in the above figure) to breakup long ranges. Specifically, we create a new non-empty cell every anchor_gap
bases from the start of the range (where anchor_gap
is a user-defined parameter), which is identical to the range cell, except that (1) it has a new start coordinate and (2) it stores the real start position in an attribute.
Note that regardless of the number of samples, we do not inject any additional information other than that of the anchors, which is user configurable and turns out to be negligible for real datasets. In other words, this solution leads to linear storage in the number of samples, thus being scalable.
The typical access pattern used for variant data involves one or more rectangles covering a set of genomic ranges across one or more samples. In the figure below, let the black rectangle be the user's query. Observe that the results are highlighted in blue (v1, v2, v4, v7
). However, the rectangle misses v1
, i.e., the case where an Indel/CNV range intersects the query rectangle, but the start position is outside the rectangle.
This is the motivation behind anchors. TileDB-VCF expands the user's query range on the left by anchor_gap
. It then reports as results the cells that are included in the expanded query if their end position (stored in an attribute) comes after the query range start endpoint. In the example above, TileDB-VCF retrieves anchor a1
and Indel/CNV v3
. It reports v1
as a result (as it can be extracted directly from anchor a1
), but filters out v3
.
Quite often, the analyses requires data retrieval based on numerous query ranges (up to the order of millions), which must be submitted simultaneously. TileDB-VCF leverages the highly optimized multi-range subarray
But what about updates? That's the topic of the next sub section.
TileDB-VCF is built in C++ for performance, and comes with an efficient Python API, a command-line tool (CLI) and integrations with Spark and Dask. We are working hard to expand on the APIs through which our users can access the variant data stored in the open-spec TileDB format, and we are happy to accommodate user requests (which can be filed here).
Being able to access the data in various programming languages and computational frameworks unlocks a vast opportunity for utilizing the numerous existing open-source data science and machine learning tools.
You can find more detailed documentation about TileDB-VCF (including the installation instructions, API reference and How To guides) in the following dedicated section:
Multi-dimensional arrays have been around for a long time. However, there have been two misconceptions about arrays:
Arrays are used solely in scientific applications. This is mainly due to their massive use in Python, Matlab, R, machine learning and other scientific applications. There is absolutely nothing wrong with arrays capturing scientific use cases. On the contrary, such applications are important and challenging, and there is no relational database that can efficiently accommodate them.
Arrays are only dense. Most array systems (i.e., storage engines or databases) built before TileDB focused solely on dense arrays. Despite their suitability for a wide spectrum of use cases, dense arrays are inadequate for sparse problems, such as genomics, LiDAR and tables. Sparse arrays have been ignored and, therefore, no array system was able to claim universality.
The sky is the limit in terms of applicability for a system that supports both dense and sparse arrays. An image is a 2D dense array, where each cell is a pixel that can store the RGBA color values. Similarly a video is a 3D dense array, two dimensions for the frame images and a third one for the time. LiDAR is a 3D sparse array with float coordinates. Genomic variants can be modeled by a 3D array where the dimensions are the sample name (string), the chromosome (string) and the position (integer). Time series tick data can be modeled by a 2D array, with time and tick symbol as labeled dimensions (this can of course be extended arbitrarily to a ND dense or sparse array). Similarly, weather data can be modeled with a 2D dense array with float labels (the lat/lon real coordinates). Graphs can be modeled as (sparse 2D) adjacency matrices. Finally, a flat file can be stored as a simple 1D dense array where each cell stores a byte.
But what about tabular data? Arrays have a lot of flexibility here. In the most contrived scenario, we can store a table as a set of 1D arrays, one per column (similar to Parquet for those familiar with it). This is useful if we want to slice a range of rows at a time. Alternatively, we can store a table as a ND sparse array, using a subset of columns as the dimensions. That would allow rapid slicing on the dimension columns. Finally, we can use labeled dense arrays as explained above for the time series tick data.
You may wonder how we can make all these decisions about dimensions vs. attributes and dense vs. sparse for each application. To answer that, we need to understand how dense and sparse arrays lay data out on the storage medium, and what factors affect performance when slicing, which is the focus of the key concepts and data format section:
In addition, check out the various TileDB use cases in more detail: