Read Algorithm

One of the primary motivations for storing variant-call data with TileDB-VCF is its high-performance reads. The reading process in TileDB-VCF entails the following: given a list of sample names, a list of genomic regions, return all VCF records from specified samples that intersect any of the specified genomic regions. In addition, you can select to read only a subset of the VCF fields.

For reading large regions, or BED files containing many small regions, TileDB-VCF greatly outperforms existing solutions such as bcftools. With a dataset resident on shared storage such as S3, reading can also be completely parallelized or distributed across multiple machines.

The resulting records can either be written to a file (in tab-separated text, or VCF/BCF format), or copied to a set of in-memory buffers for further processing. The command-line interface provides the mechanism to export to file(s), whereas the TileDB-VCF APIs (such as Python, Dask and Spark) provide the interface to receive the records in memory.

In addition to performing the region intersection task, you can also always extract the original VCF file for any sample simply by specifying an empty regions list; an empty regions list is interpreted to mean "all records."

Regardless of whether you read into memory or export to a file, the reading algorithm is the same. The main idea behind reading from a 2D sparse array is presented in our Genomics use case docs. We recommend reading those docs before proceeding to the algorithmic details. Of particular importance is the use of "anchors" and the query subarray expansion to correctly locate the gVCF range intersections.

Algorithm

With the understanding of the anchor cells and subarray expansion, the reading algorithm can be described at a high level by the following steps. Several performance- and memory-management-related details are omitted for clarity.

  1. Prepare the TileDB query subarray:

    1. Convert the samples to be read to a bounding range in row coordinates.

    2. Sort the regions to be read

    3. Convert each region to be read into a column range in global genomic position. Widen each column range by adding the anchor gap value g to its end coordinate.

    4. Construct a multi-range TileDB subarray with the row coordinate (sample ID) range and the multiple widened column ranges.

  2. Allocate attribute buffers for the desired VCF fields and submit the TileDB query

  3. For each cell in the query results:

    1. Find the list of original regions that intersect the cell (may be none).

    2. Report the cell once for every intersecting original region

There are quite a few details hidden in the above sketch, so we will examine several steps in more detail separately.

Query Subarrays

At the lowest level, TileDB-VCF must convert the list of samples and regions being read to one or more TileDB read queries on the underlying sparse array. The first task is is to prepare the subarrays for the TileDB queries that will be made.

Due to the data locality in the underlying TileDB array, multiple read queries are made, separated by the space tiles of the samples being read. For example, if samples 0, 1, 2, and 14 are requested, instead of using the single bounding interval from row 0 to 14, TileDB-VCF batches based on the space tiling. In the default case (space tile extent of 10), these samples would be batched into two TileDB queries, one with row range [0, 2] (inclusive) and a second with row range [14, 14] (inclusive). Each of these separate queries is run through the read algorithm independently.

Next, due to the expansion step for each region, there is frequently a scenario where the newly-widened column regions may overlap. For example, suppose the two regions chr1:100-101 and chr1:150-151 are specified for reading, and the anchor gap g is the default value of 1000. The expanded regions become [100, 1101] and [150, 1151] (inclusive column ranges), which have significant overlap. The "merging" step of the algorithm groups these two ranges into a single range [100, 1151] that will be used as the column range for the TileDB query.

Finally, TileDB-VCF forms the read queries that will be issued to TileDB using the row and column ranges just constructed, using TileDB's "multi-range subarray" read queries. This allows TileDB-VCF to submit a single TileDB query with all column ranges, and process the incomplete results until the query is completed. Constructing the query in this way simplifies the state that must be tracked in TileDB-VCF.

Buffer Allocation

Careful memory management during reading is a significant challenge, as frequently the dataset and read data are of a size far too large to fit into main memory at one time. One area this manifests is during the allocation of the attribute query buffers that TileDB-VCF uses to receive data read from the TileDB array.

In an effort to make memory consumption more predictable, the reading process has a configuration parameter controlling the "memory budget." The memory budget parameter b (in units of MB) is distributed as follows:

  • b/2 MB are given to TileDB itself for its own internal usage via the sm.memory_budget{_var} config parameters.

  • The remaining b/2 MB are used for attribute buffer allocation.

  • The reading algorithm uses double buffering to allow for latency hiding (submitting the next TileDB query while processing the results of the previous one). This means each set of attribute buffers gets b/4 MB.

  • The minimum number of attributes for the read algorithm to work is 3: pos, real_end, and TILEDB_COORDS. Some of these attributes are fixed-length and some are variable-length (requiring a separate offset buffer to be allocated).

  • If the number of separate buffers to be allocated is n (encompassing attribute data and offset buffers), then each buffer gets an allocation of (b / 4) / n MB.

You can use this to work backwards if you want to control the size of the attribute buffers being allocated. If you want to target an allocation of 100MB per attribute buffer, then (b/4) / n = 100 and therefore you should set b = 400 * n MB.

Cell Processing

After submitting a read query to TileDB, the cells must be processed individually by TileDB-VCF to identify which genomic regions each cell intersects with (if any). The cell is reported once for each intersection, meaning the same cell may need to be reported multiple times if multiple genomic regions intersect it. It can easily happen that a cell does not intersect any of the genomic regions being read, in which case it is discarded.

Additionally, some cells in the results will be anchor cells. The "real" cell corresponding to the END position of the containing VCF record should be reported, not the anchor itself. In the event that multiple anchor cells for the same VCF record are in the query results, we must avoid duplicate reporting of the cell.

The cell processing algorithm in pseudocode is as follows:

Pseudocode
Pseudocode
for cell c in query results:
// Skip duplicate anchor cells in query results
if last_reported[c.sample_id] == c.real_end:
continue
‚Äč
// Get all the query regions intersecting cell
i, j = get_intersecting_regions(c.start, c.real_end)
// Report all intersections
for k = i to j:
report_cell(c, k)
// Update map to prevent duplicate reporting
last_reported[c.sample_id] = c.real_end

The last_reported stores a map from sample ID to the real_end attribute value of the cell that was most recently reported. Because anchor cells use the real_end attribute to store the END position of the VCF records containing them, this allows a way to avoid reporting multiple anchor cells (or the END cell itself) once the first anchor cell for a VCF record has been reported.

The get_intersecting_regions() function returns two indexes into the sorted list of original genomic regions being read (i.e. not the expanded or merged regions). Because the regions are sorted, the interval i to j (which may be an empty interval, in the event no original regions intersect the cell) can be iterated directly over, one intersection being reported for the cell c with the intersecting original genomic region k.

Because the list of original genomic regions may be large (millions of regions), the get_intersecting_regions() performs an efficient search to determine the interval, aided by the sorted property of the list.

Reading Formats

TileDB-VCF supports two main modes for reading the cells (VCF records): export to-disk and read in-memory.

Export to Disk

The to-disk export produces one output file per sample being exported, in VCF/BCF or TSV (tab separated) format. This export mode is self-explanatory; the output file produced for a sample contains the VCF records from the sample that intersect the genomic regions that were specified to be exported. Note that the output is not guaranteed to be sorted in any way.

Read In-memory

This mode copies the VCF attributes into a set of user-specified in-memory buffers, and is used by the high-level APIs such as Python and Spark to read intersecting VCF records into a dataframe that can be manipulated programmatically.

When copying VCF data into the user-specified buffers, the Apache Arrow format is used to allow for efficient zero or near-zero copy into higher level representations such Pandas or Spark dataframes. To that end, a specific schema is used for the various VCF fields and attributes that can be read.

The in-memory schema is similar but not the same as the underlying TileDB array schema. In particular, the in-memory schema provides an Arrow-compatible interface to handle nullable attribute values, and specific info or format fields can be dynamically extracted into different in-memory buffers even if they are not stored as separate attributes in the TileDB array. These attributes will correspond to the columns of the resulting dataframe.

Attribute Name

Description

Datatype

Nullable

sample_name

The sample name

List<char>

no

contig

The contig name of record

List<char>

no

pos_start

The 1-based start position of record

int32

no

pos_end

The 1-based end position of record

int32

no

query_bed_start

The 0-based start position of the query BED range that intersected the record

int32

no

query_bed_end

The 1-based end position of the query BED range that intersected the record

int32

no

alleles

List of allele names

List<List<char>>

no

id

ID string

List<char>

no

filters

List of filter names

List<List<char>>

yes

qual

The quality value

float32

no

info_*

A specific INFO field value

List<T>

yes

fmt_*

A specific FMT field value

List<T>

yes

info

Info byte blob of non-attribute fields

List<byte>

yes

fmt

Format byte blob of non-attribute fields

List<byte>

yes

Following the Arrow specification, the different attributes need to have specific buffers allocated depending on their datatype:

  • The fixed-length attributes such as qual or pos_start need only a single data buffer allocated, to hold the values per record.

  • The list attributes such as id or sample_name need a data buffer allocated (holding the characters) as well as an int32_t offsets buffer (holding the offsets).

  • The list-of-lists attributes such as filters need a data buffer, an offsets buffer, and a second offsets buffer called the "list offsets" buffer.

  • The nullable attributes such as info additionally need a validity bitmap buffer.

In general the high-level APIs such as Python and Spark will take care of determining the number and type of the buffers that must be allocated. The TileDB-VCF C API provides a method to retrieve the datatype of a given attribute name.

The info_* / fmt_* attributes are used to extract values of specific INFO or FMT field values. Using these fields do not require that they were stored as extracted attributes during ingestion (although access time may be improved if they were). For example, to extract the MIN_DP format field, a buffer (and offsets, list offsets, and validity bitmap) should be set with the name fmt_MIN_DP. The values stored in the data buffer will be the datatype specified in the VCF header. For example, fmt_GT will extract the GT format field, which is a list of integer values.

The byte format of the data copied into the attribute buffers adheres to the Apache Arrow specification. This allows for a completely zero-copy construction of an Arrow Table using the attribute buffers directly, and without any modification. For more details on how to interpret the data stored in the buffers, including the list offsets and validity bitmaps, please see the Arrow documentation.

Because the data being read may be too large to fit in memory, TileDB-VCF will fill in as many records as will fit in the user-allocated attribute buffers. The "incomplete" result is always a complete set of records, i.e. the minimum buffer allocations must be large enough to hold a single record. In this way you can process the records batch-by-batch, resubmitting the read operation until you've processed all intersecting records. TileDB-VCF provides an API to check with each batch whether there are more records coming, or if the result is complete.