Data Model

The key ideas behind storing a VCF dataset as a 2D sparse array are described in detail in the Storing Variants as Arrays section. At a high level, every variant in each VCF file becomes a cell in a 2D sparse array, with a unique coordinate pair (sample, end_pos), where sample is a sample id and end_pos corresponds to the END field in the VCF format. In other words, the first array axis corresponds to samples/VCF files, and the second axis to the global genomic position domain. The rest of the VCF fields become array attributes. Note that a gVCF range may be generate multiple points if it is too long; TileDB-VCF breaks long ranges and introduces "anchors", which duplicate the attribute values of the gVCF entry. This greatly boosts the read performance in TileDB-VCF.

Array Schema

The sparse TileDB array schema for the variant data array is quite straightforward to understand if you are familiar with the concepts behind TileDB array schemas.

The high-level array schema parameters are:

Parameter

Value

Array type

Sparse

Rank

2D

Domain datatype

uint32_t

Cell order

Column-major

Tile order

Column-major

Dimensions

The dimensions in the schema are:

Dimension Name

TileDB Datatype

Description

sample

uint32_t

A sample id, generated by TileDB-VCF for every ingested sample. TileDB maintain a mapping between sample ids and corresponding sample names and VCF file headers.

end_pos

uint32_t

VCF END field

As mentioned before, the coordinates of the 2D array are sample ID/index along the first dimension (rows), and global genomic position along the second dimension (columns). Because the human genome only contains about 3 billion possible positions, a uint32_t domain type can represent all positions, and so we choose that as the dimension types of the array.

The tile extent (by default) of the first dimension is 10 (i.e. 10 samples per space tile). There is no tiling on the second dimension, which will be explained in the section on ingestion. The row tile extent is user-configurable on a per-array basis.

Attributes

For most fields in a single VCF record (i.e. a row in a .vcf file), there is a corresponding attribute in the schema storing the value of that field. The attributes in the schema are:

Attribute Name

TileDB Datatype

Description

pos

uint32_t

VCF POS field (in global genomic position)

qual

float

VCF QUAL field

alleles

var<char>

CSV list of REF and ALT VCF fields

id

var<char>

VCF ID field

filter_ids

var<int32_t>

Vector of integer IDs of entries in the FILTER VCF field

real_end

uint32_t

For "anchor" cells, the END position of the corresponding "real" VCF record.

info_*

var<uint8_t>

One or more attributes storing specific VCF INFO fields, e.g. info_DP, info_MQ, etc.

fmt_*

var<uint8_t>

One or more attributes storing specific VCF FORMAT fields, e.g. fmt_GT, fmt_MIN_DP, etc.

info

var<uint8_t>

Byte blob containing any INFO fields that are not stored as explicit attributes.

fmt

var<uint8_t>

Byte blob containing any FMT fields that are not stored as explicit attributes.

Most of the attributes are fairly self-explanatory and map to individual fields of the VCF data.

The info_* and fmt_* attributes allow individual INFO or FMT VCF fields to be extracted into explicit array attributes. Following the typical reasoning behind column-oriented data stores, this can be beneficial if your read queries frequently access only some of the INFO or FMT fields, as no unrelated data then needs to be fetched from storage.

The choice of which fields to extract as explicit array attributes is user-configurable during ingestion.

Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes info and fmt. The byte format of this blob is explained in the byte format section.

Note the END position of VCF records is not stored in an attribute. This is because the END position is used as the column coordinate, and there is no need to store it again as an attribute.

The real_end attribute is used for the "anchor" cells, and stores the global END position of the variant record containing the anchor. More explanation of the anchor cells and their implementation is discussed in the ingestion and read algorithm sections.

There are a few more details about the actual byte data stored in these attributes which will be described in the Byte Format section.

Array Parameters

During ingestion, there are several array-related parameters that the user can control. These are:

  • Row space tile extent (default 10)

  • Array data tile capacity (default 10,000)

  • The "anchor gap" size (default 1,000)

  • The list of INFO and FMT fields to store as explicit array attributes (default is none).

These parameters mostly impact read performance, although they may also impact the size of the persisted array.

Once chosen, these parameters cannot be changed. Therefore, some care should be taken to determine good values of these parameters before ingesting a large amount of data into an array.

Metadata

Accompanying the TileDB array storing the actual variant data is a small amount of metadata. There are two types of metadata stored alongside the array:

  • "General" metadata

  • VCF header for every sample

The information stored in the metadata is discussed below. The serialization/byte format of the metadata is discussed separately.

General metadata

The "general" metadata stores the following information:

  • tile_capacity Array tile capacity value

  • row_tile_extent Row tile extent value

  • anchor_gap Anchor gap value

  • extra_attributes List of INFO or FMT field names that are stored as explicit array attributes.

  • all_samples List of all sample names currently stored in the array.

  • sample_ids Mapping of sample name to row coordinate (used during exports to find the row corresponding to a named sample)

  • free_sample_id the ID to use for the next sample ingested

  • contig_offsets Mapping of contig name to offset in the global genomic space (used to convert genomic regions for specific contigs/chromosomes into their column coordinates)

  • contig_lengths Mapping of contig name to contig length (used for converting column coordinates back to the proper contig name)

  • total_contig_length Sum of lengths of all contigs

These metadata values are updated during sample ingestion, and are used during the export phase. The metadata is stored as "array metadata" in the data variant TileDB array.

When ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header. This is because the mapping of contig to global genomic position must be consistent across all samples in order for the export algorithm to work correctly.

VCF sample headers

In addition to the general metadata detailed above, we also store the original text of the VCF header for every sample in the array. This is for two reasons:

  • To ensure that the original VCF file can always be recovered for any given sample stored in the array. The original header is prepended to the VCF file during export.

  • To reconstruct an htslib header instance during reading, used for operations such as mapping a filter ID back to the filter string, etc.

Because the VCF header data can be quite large, and there can be many samples stored in the array, the header data is stored in a separate TileDB array alongside the data array itself. The array schema is quite simple:

Parameter

Value

Array type

Dense

Rank

1D

Tile extent

10

Dimension type

uint32_t

And there is only a single attribute:

Attribute Name

TileDB Datatype

Description

header

var<char>

Original text of the VCF header

The dense vector is indexed by sample ID, which allows for efficient retrieval of a specific sample's VCF header during export.

Putting It All Together

To summarize, we've described three main entities:

  • The variant data array (2D sparse)

  • The general metadata, stored in the variant data array as metadata

  • The VCF header array (1D dense)

All together we term this a "TileDB-VCF dataset." Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:

<dataset_uri>/
|_ __tiledb_group.tdb
|_ data/
|_ __array_schema.tdb
|_ __meta/
|_ <general-metadata-here>
... <other array directories and files>
|_ vcf_headers/
|_ __array_schema.tdb
... <other array directories and files>

The root of the dataset, <dataset_uri> is a TileDB group. The data member is the TileDB 2D sparse array storing the variant data. This array stores the general TileDB-VCF metadata as its array metadata in folder data/__meta. The vcf_headers member is the TileDB 1D dense array containing the VCF header data.