Storing Variants as Arrays

Variants and gVCF Files

A (g)VCF files contains information only about potential differences between a sample genome and a reference genome, called variants. The human genomes of any two individuals differ by about 0.1%, therefore these differences are expected to be very few compared to the ~3 billion bases of the human genome. The VCF file of each individual, the single-sample VCF, can be combined to create multi-sample (or "population" or "combined") VCF files using specialized software like bcftools.

Let's take a look at a 3-sample VCF file, i.e. a VCF file that encodes variant information for three samples (see the VCF specification for details).

#fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

At the beginning there is a header (first 18 lines in the example above) that lists various information (e.g. file format, reference genome, order of the encoded information, etc.). After that, each line corresponds to a variant, which can be a base change (called SNP), an insertion or deletion (called INDEL), or a gVCF range stating the probability that some base in the range is not the same as the reference genome. Each line also contains the corresponding base values in the reference genome along with many other fields. The sample-specific information is located in the last 3 columns.

Bioinformaticians are running all sorts of analyses on collections of VCF data from small or large populations. One can use the data to discover disease associations or even quantify the genetic evidence utilized to design new treatments. A typical analysis requires the retrieval of variant information from specific genomic ranges (e.g., chr1:10-100, chr2:1000-1500), any subset of samples, and any subset of the VCF fields. This analysis can be repeated several times, even millions of times. Therefore, it is imperative that the retrieval of such information can be done in a fast, scalable and cost-effective manner in order to facilitate and empower one’s research.

The N+1 Problem

Suppose you wish to analyze a large collection of single-sample VCF files, say 100,000 VCF files. Slicing genomic regions across all those files is prohibitively costly in terms of performance, since your software will have to "jump" to the random location in each of those files to retrieve the desired data. This leads to large latencies and high IO cost, which cripples the overall performance (retrieving the data ends up being much slower than actually processing the data).

This simple task was the motivation behind multi-sample VCF files, i.e., merging the entire collection of single-sample VCFs into a single file, where each line groups the information from all samples for a given genomic position or range. This line can be located very quickly via some indexing incorporated in the VCF file and the retrieval of the relevant data is reduced to a simple, super-fast linear scan of the line in the file, minimizing latency and significantly boosting IO performance.

This improvement comes with at a great cost. Consider the following 3 single-sample VCF files, omitting most of the fields.

#CHROM POS ... s1
1 5 ... <entry_1>
1 10 ... <entry_2>
#CHROM POS ... s2
1 6 ... <entry_1>
1 15 ... <entry_2>
#CHROM POS ... s3
1 9 ... <entry_1>
1 17 ... <entry_2>

Ignoring the rest of the fields, each of those files has 2 entries on the sample column (s1, s2, s3). Now, merging s1.vcf and s2.vcf, you get something like this:

#CHROM POS ... s1 s2
1 5 ... <entry_1> <entry_2>
1 6 ... <entry_3> <entry_4>
1 10 ... <entry_5> <entry_6>
1 15 ... <entry_7> <entry_8>

From a total of 4 entries on the sample columns observed in the two single-sample VCF files, we now observe 8 entries in the merged file s1_s2.vcf. If we merge all three files, we get:

#CHROM POS ... s1 s2 s3
1 5 ... <entry_1> <entry_2> <entry_3>
1 6 ... <entry_4> <entry_5> <entry_6>
1 9 ... <entry_7> <entry_8> <entry_9>
1 10 ... <entry_10> <entry_11> <entry_12>
1 15 ... <entry_13> <entry_14> <entry_15>
1 17 ... <entry_16> <entry_17> <entry_18>

From a total of 6 entries on the sample columns observed in the three single-sample VCF files, we now observe 18 entries in the merged file s1_s2_s3.vcf! In the general case, the size of a multi-sample VCF file scales superlinearly in the number of samples it encompasses. The problem is essentially that the single-sample VCF files have some very sparse data, which the multi-sample VCF file densifies by adding dummy information on the sample columns to comply with the VCF format. This means that the multi-sample VCF solution is not scalable and can lead to storage overhead explosion and very high merging cost for very large population studies.

Another problem is that the multi-sample VCF file cannot be updated. Due to the fact that the sample-related information is listed in the last columns of the file, a new sample insertion will need to create a new column and inject values at the end of every line. This effectively means that a brand new multi-sample VCF will need to be generated. If the multi-sample file is large (typically on the order of many GBs or even TBs), the insertion process will be extremely slow.

The two aforementioned issues with the multi-sample VCF files, storage explosion and lack of data update support, are referred to as the "N+1 problem". So how do we solve this problem?

Variants as Sparse Arrays

Representing population variants is in fact a 2D problem. Imagine a 2D plane, where the vertical axis is the samples (it can be unbounded), and the horizontal axis is the global domain of genomic positions, which has range of about 3 billion for the human genome. The figure below shows an example with 3 samples, s1, s2, and s3. Every variant can be represented as a range; it can be unary (i.e., a SNP) in which case we draw it as a dot, or it can be a longer range (e.g., INDEL or gVCF range) depicted as a horizontal line segment. Each row in this plane now contains all the information in a single-sample VCF file. The access pattern during the analysis is typically one or more rectangles covering a set of samples and a set of genomic positions. The results are comprised of every variant that intersects the rectangle, namely v1, v2, v4 and v7 in the figure (depicted in blue).

Now suppose we represent the start position of each range as a non-empty cell in a 2D sparse array. We illustrate the start positions in the figure below as squares. In each of those array cells, we can store as array attributes the end position of each cell, as well as all the other fields in the corresponding single-sample VCF files for each variant (e.g., REF, ALT, etc.). Therefore, for every sample, we can map variants to 2D non-empty sparse array cells. Issuing the query rectangle as a 2D range query in the array, we can get the non-empty cells that fall inside the rectangle as results, namely v2, v4, and v7 in the figure. However, this approach misses intersecting ranges with start positions that fall outside the query, such as v1 below (in red). Therefore, we are not done yet, we need to perform one more tweak.

Suppose we choose a configuration parameter that determines the maximum length of a gVCF range. We pick 1000 bases as a default after some empirical fine-tuning, but this is totally configurable by the user. We next "cut" each gVCF range that exceeds 1000 bases into 1000-base-long pieces, introducing "anchors" as shown in the figure, which duplicate the information of the variant. We store those anchors as non-empty array cells as well. Finally, we issue the query after we expand by exactly 1000 bases. It is easy to see that the expanded query is guaranteed to include the start position of either an intersecting range or an anchor of an intersecting range (e.g., v1).

In other words, by representing variants as non-empty cells in a 2D sparse array and with the additional use of anchors and expanded queries, we managed to model population variants as 2D sparse arrays. In addition, note that regardless of the number of samples, we do not inject any additional information other than that of the anchors, which turns out to be negligible for real datasets. In other words, this solution leads to linear storage in the number of samples, thus being scalable. And if you store the variants in a powerful array engine that offers rapid updates, then you have just solved the N+1 problem! Enter TileDB...