Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
TileDB-VCF is an open-source C++ library for efficient storage and retrieval of genomic variant call data based on TileDB Open Source.
TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute. The resulting software, GenomicsDB, became part of Broad's GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of TileDB Open Source, incorporating new algorithms, features and optimizations. TileDB-VCF is the product of a collaboration between TileDB, Inc. and Helix.
TileDB-VCF offers several important benefits:
Performance: TileDB-VCF is optimized for rapidly slicing variant records by genomic regions across arbitrary numbers of samples. Additional features for ingesting VCF files and performing genomic interval intersections are all implemented in C++ for maximum speed.
Compressibility: TileDB-VCF can efficiently store any number of samples in a compressed and lossless manner. Being a columnar format, TileDB-VCF can compress different VCF fields with different compressors, choosing the most appropriate one depending on the data types and nature of the data per field.
Optimized for cloud object stores: Built on TileDB Open Source, TileDB-VCF inherits all its features out of the box, such as the speed and optimizations on numerous storage backends, including cloud object stores such as AWS S3, Azure Blob Storage and Google Cloud Storage.
Updatability: TileDB-VCF allows you to rapidly add new samples to existing datasets, eliminating the so-called N+1 problem and scaling both storage and update time linearly with the number of new samples.
Interoperability: In addition to a command-line interface, TileDB-VCF provides C++ and Python APIs, as well as integrations with Spark and Dask. This allows the user to utilize a plethora of tools from the Data Science and Machine Learning ecosystems in a seamless and natural fashion, without having to convert data from one format to another or use unfamiliar APIs.
Open-source: TileDB Open Source and TileDB-VCF are both open source and developed under the permissive MIT license. We are passionate about supporting the scientific community with open-source software and welcome feedback and contributions to further improve our projects.
Battle-tested: TileDB-VCF has been used in production by very large customers to manage immense amounts of variant data, in the order of many hundreds of thousands of samples.
The documentation of TileDB-VCF includes information about the data model (and its mapping to 3D sparse arrays), installation instructions, How To guides and the API reference:
TileDB-VCF uses 3D sparse arrays to store genomic variant data. This section describes the technical implementation details about the underlying model, including array schemas, dimensions, tiling order, attributes, and metadata. It is recommended that you read the following section to fully understand the mapping between the TileDB-VCF data and TileDB arrays and groups:
A TileDB-VCF dataset is composed of a group of two separate TileDB arrays:
a 3D sparse array for the actual genomic variants and associated fields/attributes
a 1D sparse array for the metadata stored in each single-sample VCF header
The dimensions in the schema are:
The coordinates of the 3D array are contig
along the first dimension, chromosomal location of the variants start position along the second dimension, and sample
names along the third dimension.
For each field in a single-sample VCF record there is a corresponding attribute in the schema.
The info_*
and fmt_*
attributes allow individual INFO
or FMT
VCF fields to be extracted into explicit array attributes. This can be beneficial if your queries frequently access only a subset of the INFO
or FMT
fields, as no unrelated data then needs to be fetched from storage.
The choice of which fields to extract as explicit array attributes is user-configurable during array creation.
Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes, info
and fmt
.
anchor_gap
Anchor gap value
extra_attributes
List of INFO
or FMT
field names that are stored as explicit array attributes
version
Array schema version
These metadata values are updated during array creation, and are used during the export phase. The metadata is stored as "array metadata" in the sparse data
array.
When ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header. This requirement will be relaxed in future versions.
The vcf_headers
array stores the original text of every ingested VCF header in order to:
ensure the original VCF file can be fully recovered for any given sample
reconstruct an htslib
header instance when reading from the dataset, which is used for operations such as mapping a filter ID back to the filter string, etc.
To summarize, we've described three main entities:
The variant data array (3D sparse)
The general metadata, stored in the variant data array as metadata
The VCF header array (1D sparse)
All together we term this a "TileDB-VCF dataset." Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:
The root of the dataset, <dataset_uri>
is a TileDB group. The data
member is the TileDB 3D sparse array storing the variant data. This array stores the general TileDB-VCF metadata as its array metadata in folder data/__meta
. The vcf_headers
member is the TileDB 1D sparse array containing the VCF header data.
During array creation, there are several array-related parameters that the user can control. These are:
Array data tile capacity (default 10,000)
The "anchor gap" size (default 1,000)
The list of INFO
and FMT
fields to store as explicit array attributes (default is none).
Once chosen, these parameters cannot be changed.
During sample ingestion, the user can specify the:
Sample batch size (default 10)
The above parameters may impact read and write performance, as well as the size of the persisted array. Therefore, some care should be taken to determine good values for these parameters before ingesting a large amount of data into an array.
You can install TileDB-VCF in two ways:
Indexed files are required for ingestion. If your VCF/BCF files have not been indexed, you can use to do so:
You can ingest samples into an already created dataset as follows:
Just add a regular expression for the VCF file locations at the end of the store
command:
Alternatively, provide a text file with the absolute locations of the VCF files, separated by a new line:
Incremental updates work in the same manner as the ingestion above, nothing special is needed. In addition, the ingestion is thread- and process-safe and, therefore, can be performed in parallel.
The first step before ingesting any VCF samples is to create a dataset. This effectively creates a TileDB group and the appropriate empty arrays in it.
If you wish to turn some of the INFO
and FMT
fields into separate materialized attributes, you can do so as follows (names should be fmt_X
or info_X
for a field name X
- case sensitive).
We are building the following extensions for Genomics: