TileDB offers a powerful solution for storing and analyzing at scale very large amounts of genomic variant data (gVCF), called TileDB-VCF. Our vision is to enable large-scale genomics applications at a low cost.
Genomic information is traditionally stored in a collection of gVCF files that limit the scalability of common operations like data merging. TileDB-VCF models the same information as a rapidly updatable sparse matrix, solving the notorious N+1 problem, and allowing to arbitrarily scale population genetics analysis to millions of samples in a cloud-based cost-effective manner. TileDB-VCF is written in C++ and comes with C++ and Python APIs (and many more to soon follow), as well as integration with Spark and Dask. It allows the user to utilize a plethora of tools from the Data Science ecosystem in a seamless and natural fashion. All these, without having to convert data from one format to another or have to use unfamiliar APIs, while always enjoying top performance.
TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute.. The resulting software, GenomicsDB, became part of Broad's GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of TileDB, incorporating new algorithms, features and optimizations. TileDB-VCF is the product of a collaboration between TileDB, Inc. and Helix.
In the remainder of the page we describe how TileDB-VCF helps you store, update and analyze huge quantities of variant data at scale.
DNA is the blueprint of life, as it encodes the necessary information to create an organism. DNA’s vocabulary is a set of chemical units, bases, that are represented with the characters A, T, C, and G. The human genome is an ordered sequence of about 3 billion of these bases, organized across two copies of 23 distinct chromosomes – one from the mother’s side and the other from the father’s. This ordered sequence is packaged and carried around in almost every cell of an organism.
Next generation sequencing (NGS) methods chop this sequence, your genome, in million smaller pieces and they generate digital representations of the original pieces, the reads. The reads are packaged in a digital file, which undergoes various downstream analyses of a typical pipeline like the (overly simplified) one shown in the figure below. First, your DNA sample goes into a genomic sequencer that produces a compressed text file, in a format called FASTQ. This is typically 150GBs in size and pretty much unusable at this stage, as it contains reads of your genome with multiplicities and overlaps, without determining which read corresponds to which exact location in your genome. In the next stage of the pipeline, the FASTQ file is processed by an aligner, which uses a so-called reference genome (in FASTA format, similar to FASTQ) as a guide to map the scattered reads to the global genome order, producing a binary file in BAM format. This file is also in the order of 150GBs.
Finally, the BAM file is processed by software that can recreate the exact sequence of your genome, accounting for various errors and biases. This software, the variant caller, produces a file called VCF (or BCF, a compressed binary alternative). This file is now much smaller, in the order of 100MB-500MB for whole exomes or in the order of a few GBs for whole genomes (these numbers may vary based on coverage and other factors, but they are much smaller than the original BAM file size). This file is smaller because the VCF format records information only about potential differences between your genome and the reference genome, called variants. The genomes of any two individuals differ by about 0.1%, therefore these differences are expected to be very few compared to the ~3 billion bases of the human genome. The VCF file of each individual, the single-sample VCF, can be combined to create multi-sample (or population) VCF files using specialized software like
Let's take a look at a 3-sample VCF file, i.e. a VCF file that encodes variant information for three samples (see the VCF specification for details).
#fileformat=VCFv4.3##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
At the beginning there is a header - first 18 lines in the example above - that lists various information, e.g. file format, reference genome, order of the encoded information, etc.. After that, each line corresponds to a variant, which can be a base change (called SNP), an insertion or deletion (called INDEL), or a gVCF range stating the probability that some base in the range is not the same as the reference genome. Each line also contains the corresponding base values in the reference genome along with many other fields. The sample-specific information is located in the last 3 columns.
Bioinformaticians are running all sorts of analyses on collections of VCF data from small or large populations. One can use the data to discover disease associations or even quantify the genetic evidence utilized to design new treatments. A typical analysis requires the retrieval of variant information from specific genomic ranges (e.g.,
chr1:10-100, chr2:1000-1500), any subset of samples, and any subset of the VCF fields. This analysis can be repeated several times, even millions of times. Therefore, it is imperative that the retrieval of such information can be done in a fast, scalable and cost-effective manner in order to facilitate and empower one’s research.
Suppose you wish to analyze a large collection of single-sample VCF files, say 100,000 VCF files. Slicing genomic regions across all those files is prohibitively costly in terms of performance, since your software will have to "jump" to the random location in each of those files to retrieve the desired data. This leads to large latencies and high IO cost, which cripples the overall performance (retrieving the data ends up being much slower than actually processing the data).
This simple task was the motivation behind multi-sample VCF files, i.e., merging the entire collection of single-sample VCFs into a single file, where each line groups the information from all samples for a given genomic position or range. This line can be located very quickly via some indexing incorporated in the VCF file and the retrieval of the relevant data is reduced to a simple, super-fast linear scan of the line in the file, minimizing latency and significantly boosting IO performance.
This improvement comes with at a great cost. Consider the following 3 single-sample VCF files, omitting most of the fields.
#CHROM POS ... s11 5 ... <entry_1>1 10 ... <entry_2>#CHROM POS ... s21 6 ... <entry_1>1 15 ... <entry_2>#CHROM POS ... s31 9 ... <entry_1>1 17 ... <entry_2>
Ignoring the rest of the fields, each of those files has 2 entries on the sample column (
s3). Now, merging
s2.vcf, you get something like this:
#CHROM POS ... s1 s21 5 ... <entry_1> <entry_2>1 6 ... <entry_3> <entry_4>1 10 ... <entry_5> <entry_6>1 15 ... <entry_7> <entry_8>
From a total of 4 entries on the sample columns observed in the two single-sample VCF files, we now observe 8 entries in the merged file
s1_s2.vcf. If we merge all three files, we get:
#CHROM POS ... s1 s2 s31 5 ... <entry_1> <entry_2> <entry_3>1 6 ... <entry_4> <entry_5> <entry_6>1 9 ... <entry_7> <entry_8> <entry_9>1 10 ... <entry_10> <entry_11> <entry_12>1 15 ... <entry_13> <entry_14> <entry_15>1 17 ... <entry_16> <entry_17> <entry_18>
From a total of 6 entries on the sample columns observed in the three single-sample VCF files, we now observe 18 entries in the merged file
s1_s2_s3.vcf! In the general case, the size of a multi-sample VCF file scales superlinearly in the number of samples it encompasses. The problem is essentially that the single-sample VCF files have some very sparse data, which the multi-sample VCF file densifies by adding dummy information on the sample columns to comply with the VCF format. This means that the multi-sample VCF solution is not scalable and can lead to storage overhead explosion and very high merging cost for very large population studies.
Another problem is that the multi-sample VCF file cannot be updated. Due to the fact that the sample-related information is listed in the last columns of the file, a new sample insertion will need to create a new column and inject values at the end of every line. This effectively means that a brand new multi-sample VCF will need to be generated. If the multi-sample file is large (typically in the order of many GBs or even TBs), the insertion process will be extremely slow.
The two aforementioned issues with the multi-sample VCF files, storage explosion and lack of data update support, are referred to as the "N+1 problem". So how do we solve this problem?
Representing population variants is in fact a 2D problem. Imagine a 2D plane, where the vertical axis is the samples (it can be unbounded), and the horizontal axis is the global domain of genomic positions, which has range of about 3 billion for the human genome. The figure below shows an example with 3 samples,
s3. Every variant can be represented as a range; it can be unary (e.g., a SNP) in which case we draw it as a bullet, or it can be a longer range (e.g., INDEL or gVCF range) depicted as a horizontal line segment. Each row in this plane now contains all the information in a single-sample VCF file. The access pattern during the analysis is typically a rectangle (or multiple ones) covering a set of samples and a set of genomic positions, like the blue one in the figure. The results are comprised of every variant that intersects the rectangle, namely
v8 in the figure (also depicted in blue).
Now suppose we represent the end position of each range as a non-empty cell in a 2D sparse array. We illustrate the end positions in the figure below as squares. In each of those array cells, we can store as array attributes the start position of each cell, as well as all the other fields in the corresponding single-sample VCF files for each variant (e.g.,
ALT, etc.). Therefore, for every sample, we can map variants to 2D non-empty sparse array cells. Issuing the query rectangle as a 2D range query in the array, we can get the non-empty cells that fall inside the rectangle as results, namely
v3 in the figure. However, this approach misses
v8 whose ranges intersect the query, but their end positions happen not to fall inside the query. Therefore, we are not done yet, we need to perform one more tweak.
Suppose we choose a configuration parameter that determines the maximum length of a gVCF range. We pick 1000 bases as a default after some empirical fine-tuning, but this is totally configurable by the user. We next "cut" each gVCF range that exceeds 1000 bases into 1000-base-long pieces, introducing "anchors" as shown in the figure, which duplicate the information of the variant. We store those anchors as non-empty array cells as well. Finally, we issue the query after we expand by exactly 1000 bases. It is easy to see that the expanded query is guaranteed to either include the end position of an intersecting range (
v5), or an anchor of an intersecting range (
In other words, by representing variants as non-empty cells in a 2D sparse array and with the additional use of anchors and expanded queries, we managed to model population variants as 2D sparse arrays. In addition, note that regardless of the number of samples, we do not inject any additional information other than that of the anchors, which turns out to be negligible for real datasets. In other words, this solution leads to linear storage in the number of samples, thus being scalable. And if you store the variants in a powerful array engine that offers rapid updates, then you have just solved the N+1 problem! Enter TileDB...
Representing variants as sparse arrays enabled us to create TileDB-VCF, a VCF customization built upon TileDB. The benefits you get from that are multiple:
TileDB's columnar layout and compression saves you up to 40% in space vs. your original collection of single-sample BCF (compressed) files.
TileDB allows you to efficiently slice numerous, in the order of millions, of arbitrary genomic ranges across samples at once, even when your data is on AWS S3, due to its internal cloud-optimized, highly parallelized multi-range reads. This leads to a huge overall performance boost and tremendous $$ savings on the cloud.
TileDB-VCF comes with a Python API and efficient integration with Spark and Dask (via Apache Arrow), allowing you to run scalable analyses with familiar Data Science tooling.
Most importantly, TileDB's rapid updates, particularly suitable for object stores where everything is immutable (like AWS S3), allows you to add any number of samples to your dataset fast and without worrying about the N+1 problem anymore.
In addition to the above, any feature (e.g., encryption, access control, integration with SQL engines, etc.), improvement and integration added to the core TileDB can be directly inherited by TileDB-VCF, keeping you always up-to-date with excellent performance and ease of use.
FASTQ is a very simple format that can be represented as a 1D dense array, where every cell is a "read" having attributes the read itself (a small, typically 100-base,
G sequence), the corresponding qualities for each base in the read, and potential metadata for the read. We expect TileDB's columnar format that splits the read bases (highly compressible) from their qualities (not well compressible), along with a careful selection of the compressors, can lead to much better compression than gzipped text. Moreover, TileDB will allow for efficient parallel access from AWS S3 due to its fast array slicing capabilities.
BAM is a quite complicated format that can be represented as a 1D sparse array, mapping the FASTQ reads to specific positions in the global genomic position domain. TileDB can have similar benefits to the case of FASTQ described above.