Ingestion Algorithm

The first step in using TileDB-VCF for efficient access to variant data is to convert an existing set of VCF (or BCF) files into the TileDB-VCF format, which is a process called "ingestion." The ingestion process takes as input a list of VCF files and produces a new or updated TileDB-VCF dataset in the sparse array format (described in Data Model).

One of the key differentiators of TileDB-VCF is that updating an existing dataset with new VCF files is completely seamless, and sidesteps the so-called "N+1" problem present in other systems. The TileDB-VCF ingestion process scales linearly with the number of samples being ingested, not the number of samples already existent in the dataset.

Additionally, due to the parallel writes feature supported by the underlying TileDB storage engine, ingestion can be scaled out across multiple independent processes or nodes with zero inter-process communication required.

There are three steps to the ingestion process:

  1. Dataset creation (called the "create" step)

  2. New sample registration ("register")

  3. New sample ingestion ("store")

The create step creates an empty TileDB-VCF dataset with a potentially customized array schema. Registration assigns row IDs to the new samples, and updates the dataset metadata. Finally, the store step performs the actual ingestion of the VCF records from the new samples.

While it would be possible to combine these three steps, TileDB-VCF implements them as separate stages in order to allow for distributed ingestion (i.e. using multiple independent machines to ingest samples simultaneously to a single dataset).

Each of the three steps is explained in more detail below.

Create

The first step to VCF ingestion is the creation of an empty TileDB-VCF dataset. This process, corresponding to the tiledbvcf create CLI invocation, simply creates an empty dataset on disk (or another location such as S3) with the array schema and directory structure.

If you have read the section on the TileDB-VCF data model, you may recall that different VCF fields can be extracted as separate attributes (i.e., dataframe columns) in the dataset. Because the data is stored in a columnar fashion, extracting VCF fields as separate attributes can improve read performance, as less data may be required to be read from persistent storage to perform a particular query.

The choice of which VCF fields to extract as separate attributes is a choice that must be made at dataset creation time. Once the dataset has been created, the schema (including list of attributes) is immutable.

There are several other choices that must be made at dataset creation time in order to use non-default values, which are:

  • Tile capacity

  • Row tile extent

  • Anchor gap

These three parameters can impact read performance. Choosing good values of these parameters is a nontrivial task, and greatly depends on the specific read queries you will be performing on the ingested dataset. The default values of these parameters have been chosen as a good balance between ingestion and read performance, as well as the resulting size of the ingested dataset.

Register

The second step of ingestion is registering new samples before they can be ingested. New samples can be registered into empty datasets, or datasets already containing ingested samples.

The registration process is straightforward: given a list of VCF files to register, TileDB-VCF downloads the VCF header (to determine the name of the sample, and the contig structure, among other things), and updates the metadata in the dataset for the new sample.

If you recall from the Data Model section, each sample in a TileDB-VCF dataset has a unique integer ID corresponding to its row index in the 2D sparse array. The most important component of registration is the assignment of row index values to the samples being ingested. The mapping of sample name to row index is stored in the dataset metadata.

One important note is that when ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header, and in the same order. This is because the mapping of contig to global genomic position must be consistent across all samples in order for the export algorithm to work correctly. The concept of global genomic position is explained in the Data Model section.

Because registration assigns unique row indexes for new samples, it is an inherently serial process. This means multiple registration processes cannot be performed in parallel to the same dataset.

Store

Once a set of new VCF samples has been registered in a dataset they can be ingested, or stored, into the dataset.

Algorithm

A simplified version of the ingestion algorithm proceeds as follows (ignoring memory constraints, parallelization, and "anchor" cells):

  1. Open the underlying TileDB array

  2. Push the first VCF record of each file into the "record heap." The record heap is a min-heap, sorting the records across samples by their gVCF END position.

  3. While the heap is not empty:

    1. Pop the top record from the heap

    2. Copy VCF field values into columnar buffers

    3. Insert the next record from the sample into the record heap

  4. Submit the TileDB write query using the prepared attribute buffers

  5. Close the TileDB array

Recall from the data model section that the column dimension is the END position of the records, and that there is configurable space tiling across the rows (samples) dimension. That means that the global ordering of VCF records is sorted on END position across the samples in a single space tile. The record heap enables this sorting across samples being ingested, which in turn allows the TileDB write queries to be submitted in global order. Additionally, due to the space tiling, the algorithm proceeds in batches of samples, where one batch is equal to one space tile, e.g. for the default space tiling of 10 rows, the samples are ingested in batches of 10.

There is one important additional detail which is the ingestion of the anchor cells. The anchor cells are dummy cells materialized in the underlying array designed to ensure all records intersecting a genomic region will be reported during export. During ingestion, "long" gVCF records are broken up by inserting these anchor cells at regular intervals. The size of the interval is called the anchor gap, which is a parameter that can be set during dataset creation.

Whenever a record is inserted into the record heap, as shown in the simplified algorithm above, one or more anchor cells are also inserted into the record heap. The number of anchor cells inserted for a record is equal to floor((end_pos - start_pos - 1) / anchor_gap). Then one anchor cell for i = 1 .. num_anchors is inserted into the record heap with END position start_pos + i * anchor_gap. In effect this ensures that long gVCF records are always "broken up" by anchor cells lying within the gVCF range. Records with gVCF length less than the anchor gap size do not need any anchors.

When popping records from the record heap, anchors are treated like normal records, with the attribute values from the record "containing" the anchor used as the anchor's attribute values. The only difference is that the REAL_END attribute of anchor cells stores the END position of the record containing the anchor, which is used during export to report the correct record, and not the anchor.

There are several additional optimizations (not elaborated on here) that obscure the overall algorithm, such as:

  • Buffering chunks of VCF records from the VCF samples so that a file I/O operation is not required for every record

  • Using fixed-size attribute buffers and submitting a TileDB query when the buffers fill up

  • Splitting the parsing work across multiple threads

  • Asynchronously downloading the next batch of samples from S3 (if necessary) while the current batch is being ingested

The sample registration process happens beforehand, thus a large list of samples can be ingested independently (e.g. using multiple machines) by simply partitioning the list of samples and running one store process for each partition. The partitioned store processes can run in parallel, from multiple machines, to the same dataset. This is only possible due to the separate explicit registration step.

Requirements

Ingestion can be a memory- and CPU-intensive process due to large amounts of data that must be held in memory, and compression/decompression of the data.

The store phase has been written to take advantage of as many CPUs as the system has, and in general you will see higher performance simply by moving the store process to a machine with more CPUs, without tweaking any parameters.

Memory consumption of the store phase is largely based on two factors: buffering of VCF records from the samples being ingested (to reduce file I/O during ingestion), and the allocation size of the attribute buffers used for TileDB query submissions. There are several important notes in tuning memory consumption:

  • The memory allowance for attribute buffers is 1GB per ingestion thread. That means with N threads, the minimum memory requirement is N GB.

  • Additionally, the buffered VCF records require approximately 300 bytes per record. The size of the record buffer can be changed to adjust the memory usage for record buffering. The thread task size parameter also impacts the max number of records that will need to be buffered at any given time.

A safe rule of thumb therefore is to assume the store process will require 2-3 times the number of threads in GB of memory. In general, to reduce memory requirements, reduce the per-thread memory limit and the thread task size. This should keep performance relatively constant (due to better load balancing with smaller thread task sizes). However, too small of a thread task size introduces a lot of overhead in repeatedly opening and seeking in the VCF files with htslib, and may reduce performance.

For disk space, the store process requires scratch space to keep the VCF samples being ingested, if the samples are being ingested from S3. The amount of scratch disk space available should be enough to store 2 * space_tile samples on disk at once. The default space tile is 10, so the disk space should be enough to store 20 samples.

Fragments

A single ingestion process performs a single global-order TileDB write query, which means one fragment in the data array is created per store invocation. This is true regardless of whether the store command is ingesting 1, 10, 1000, or an arbitrary number of samples.

While the impact of TileDB array fragments on read/export performance is an advanced topic in general, for the constrained case of TileDB-VCF it can often be advantageous to ingest samples such that one fragment is created per space tile (i.e. 10 samples per store invocation).

Because TileDB persists metadata per fragment which is loaded into main memory when reading data intersecting that fragment, fragments containing large numbers of samples may require loading a large amount of metadata that is irrelevant to the VCF export at hand. For example, if the fragment contains 1,000 samples but you are trying to export only 1 sample, the metadata for the other 999 samples must still be loaded into main memory. Constraining the size of the fragments to a smaller number of samples reduces the "wasted" metadata that will be loaded into memory when exporting a sample.

The TileDB-VCF read algorithm proceeds per space tile, opening and closing the TileDB array in between space tiles. This allows the in-memory fragment metadata to be unloaded in between space tiles, which can significantly lower the memory requirements of the export process. In other words, only the metadata of the "current" fragment is held in memory at any given time. With small fragments, this allows exporting across many samples (even the whole array) in an out-of-core fashion.

Limitations

There are some limitations on the VCF data that can be ingested with TileDB-VCF. These are:

  • There can be no gVCF overlaps in a sample. This means all gVCF records in any given sample must be completely disjoint in terms of their start and end positions. Across samples, records can overlap freely -- the limitation is within a single sample.

  • VCF samples must be sorted and indexed prior to ingestion (using e.g. bcftools).

  • All samples in any given TileDB-VCF dataset must have the same VCF header structure (same contigs with the same names and lengths, listed in the same order).

If these assumptions are not upheld, you will see an error message during the store phase about the TileDB global order being violated, e.g.:

terminate called after throwing an instance of 'tiledb::TileDBError'
what(): [TileDB::Writer] Error: Write failed; Coordinates (0,1061605964) succeed (0,1061605960) in the global order

A nicer error message is unfortunately not possible without significantly impacting performance during ingestion.