The key ideas behind storing a VCF dataset as a 2D sparse array are described in detail in the Storing Variants as Arrays section. At a high level, every variant in each VCF file becomes a cell in a 2D sparse array, with a unique coordinate pair
(sample, end_pos), where
sample is a sample id and
end_pos corresponds to the
END field in the VCF format. In other words, the first array axis corresponds to samples/VCF files, and the second axis to the global genomic position domain. The rest of the VCF fields become array attributes. Note that a gVCF range may be generate multiple points if it is too long; TileDB-VCF breaks long ranges and introduces "anchors", which duplicate the attribute values of the gVCF entry. This greatly boosts the read performance in TileDB-VCF.
The sparse TileDB array schema for the variant data array is quite straightforward to understand if you are familiar with the concepts behind TileDB array schemas.
The high-level array schema parameters are:
The dimensions in the schema are:
A sample id, generated by TileDB-VCF for every ingested sample. TileDB maintain a mapping between sample ids and corresponding sample names and VCF file headers.
As mentioned before, the coordinates of the 2D array are sample ID/index along the first dimension (rows), and global genomic position along the second dimension (columns). Because the human genome only contains about 3 billion possible positions, a
uint32_t domain type can represent all positions, and so we choose that as the dimension types of the array.
The tile extent (by default) of the first dimension is 10 (i.e. 10 samples per space tile). There is no tiling on the second dimension, which will be explained in the section on ingestion. The row tile extent is user-configurable on a per-array basis.
For most fields in a single VCF record (i.e. a row in a
.vcf file), there is a corresponding attribute in the schema storing the value of that field. The attributes in the schema are:
CSV list of
Vector of integer IDs of entries in the
For "anchor" cells, the
One or more attributes storing specific VCF
One or more attributes storing specific VCF
Byte blob containing any
Byte blob containing any
Most of the attributes are fairly self-explanatory and map to individual fields of the VCF data.
fmt_* attributes allow individual
FMT VCF fields to be extracted into explicit array attributes. Following the typical reasoning behind column-oriented data stores, this can be beneficial if your read queries frequently access only some of the
FMT fields, as no unrelated data then needs to be fetched from storage.
Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes
fmt. The byte format of this blob is explained in the byte format section.
END position of VCF records is not stored in an attribute. This is because the
END position is used as the column coordinate, and there is no need to store it again as an attribute.
real_end attribute is used for the "anchor" cells, and stores the global
END position of the variant record containing the anchor. More explanation of the anchor cells and their implementation is discussed in the ingestion and read algorithm sections.
There are a few more details about the actual byte data stored in these attributes which will be described in the Byte Format section.
During ingestion, there are several array-related parameters that the user can control. These are:
Row space tile extent (default 10)
Array data tile capacity (default 10,000)
The "anchor gap" size (default 1,000)
The list of
FMT fields to store as explicit array attributes (default is none).
These parameters mostly impact read performance, although they may also impact the size of the persisted array.
Once chosen, these parameters cannot be changed. Therefore, some care should be taken to determine good values of these parameters before ingesting a large amount of data into an array.
Accompanying the TileDB array storing the actual variant data is a small amount of metadata. There are two types of metadata stored alongside the array:
VCF header for every sample
The information stored in the metadata is discussed below. The serialization/byte format of the metadata is discussed separately.
The "general" metadata stores the following information:
tile_capacity Array tile capacity value
row_tile_extent Row tile extent value
anchor_gap Anchor gap value
extra_attributes List of
FMT field names that are stored as explicit array attributes.
all_samples List of all sample names currently stored in the array.
sample_ids Mapping of sample name to row coordinate (used during exports to find the row corresponding to a named sample)
free_sample_id the ID to use for the next sample ingested
contig_offsets Mapping of contig name to offset in the global genomic space (used to convert genomic regions for specific contigs/chromosomes into their column coordinates)
contig_lengths Mapping of contig name to contig length (used for converting column coordinates back to the proper contig name)
total_contig_length Sum of lengths of all contigs
These metadata values are updated during sample ingestion, and are used during the export phase. The metadata is stored as "array metadata" in the data variant TileDB array.
In addition to the general metadata detailed above, we also store the original text of the VCF header for every sample in the array. This is for two reasons:
To ensure that the original VCF file can always be recovered for any given sample stored in the array. The original header is prepended to the VCF file during export.
To reconstruct an
htslib header instance during reading, used for operations such as mapping a filter ID back to the filter string, etc.
Because the VCF header data can be quite large, and there can be many samples stored in the array, the header data is stored in a separate TileDB array alongside the data array itself. The array schema is quite simple:
And there is only a single attribute:
Original text of the VCF header
The dense vector is indexed by sample ID, which allows for efficient retrieval of a specific sample's VCF header during export.
To summarize, we've described three main entities:
The variant data array (2D sparse)
The general metadata, stored in the variant data array as metadata
The VCF header array (1D dense)
All together we term this a "TileDB-VCF dataset." Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:
<dataset_uri>/|_ __tiledb_group.tdb|_ data/|_ __array_schema.tdb|_ __meta/|_ <general-metadata-here>... <other array directories and files>|_ vcf_headers/|_ __array_schema.tdb... <other array directories and files>
The root of the dataset,
<dataset_uri> is a TileDB group. The
data member is the TileDB 2D sparse array storing the variant data. This array stores the general TileDB-VCF metadata as its array metadata in folder
vcf_headers member is the TileDB 1D dense array containing the VCF header data.