Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
TileDB-VCF is an open-source C++ library for efficient storage and retrieval of genomic variant call data based on TileDB Open Source.
TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute. The resulting software, GenomicsDB, became part of Broad's GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of TileDB Open Source, incorporating new algorithms, features and optimizations. TileDB-VCF is the product of a collaboration between TileDB, Inc. and Helix.
TileDB-VCF offers several important benefits:
Performance: TileDB-VCF is optimized for rapidly slicing variant records by genomic regions across arbitrary numbers of samples. Additional features for ingesting VCF files and performing genomic interval intersections are all implemented in C++ for maximum speed.
Compressibility: TileDB-VCF can efficiently store any number of samples in a compressed and lossless manner. Being a columnar format, TileDB-VCF can compress different VCF fields with different compressors, choosing the most appropriate one depending on the data types and nature of the data per field.
Optimized for cloud object stores: Built on TileDB Open Source, TileDB-VCF inherits all its features out of the box, such as the speed and optimizations on numerous storage backends, including cloud object stores such as AWS S3, Azure Blob Storage and Google Cloud Storage.
Updatability: TileDB-VCF allows you to rapidly add new samples to existing datasets, eliminating the so-called N+1 problem and scaling both storage and update time linearly with the number of new samples.
Interoperability: In addition to a command-line interface, TileDB-VCF provides C++ and Python APIs, as well as integrations with Spark and Dask. This allows the user to utilize a plethora of tools from the Data Science and Machine Learning ecosystems in a seamless and natural fashion, without having to convert data from one format to another or use unfamiliar APIs.
Open-source: TileDB Open Source and TileDB-VCF are both open source and developed under the permissive MIT license. We are passionate about supporting the scientific community with open-source software and welcome feedback and contributions to further improve our projects.
Battle-tested: TileDB-VCF has been used in production by very large customers to manage immense amounts of variant data, in the order of many hundreds of thousands of samples.
The documentation of TileDB-VCF includes information about the data model (and its mapping to 3D sparse arrays), installation instructions, How To guides and the API reference:
You can install TileDB-VCF in two ways:
TileDB-VCF uses 3D sparse arrays to store genomic variant data. This section describes the technical implementation details about the underlying model, including array schemas, dimensions, tiling order, attributes, and metadata. It is recommended that you read the following section to fully understand the mapping between the TileDB-VCF data and TileDB arrays and groups:
A TileDB-VCF dataset is composed of a group of two separate TileDB arrays:
a 3D sparse array for the actual genomic variants and associated fields/attributes
a 1D sparse array for the metadata stored in each single-sample VCF header
Parameter
Value
Array type
Sparse
Rank
3D
Cell order
Row-major
Tile order
Row-major
The dimensions in the schema are:
Dimension Name
TileDB Datatype
Corresponding VCF Field
contig
TILEDB_STRING_ASCII
CHR
start_pos
uint32_t
VCFPOS
plus TileDB anchors
sample
TILEDB_STRING_ASCII
Sample name
The coordinates of the 3D array are contig
along the first dimension, chromosomal location of the variants start position along the second dimension, and sample
names along the third dimension.
For each field in a single-sample VCF record there is a corresponding attribute in the schema.
Attribute Name
TileDB Datatype
Description
end_pos
uint32_t
VCF END
position of VCF records
qual
float
VCF QUAL
field
alleles
var<char>
CSV list of REF
and ALT
VCF fields
id
var<char>
VCF ID
field
filter_ids
var<int32_t>
Vector of integer IDs of entries in the FILTER
VCF field
real_start_pos
uint32_t
VCF POS
(no anchors)
info
var<uint8_t>
Byte blob containing any INFO
fields that are not stored as explicit attributes
fmt
var<uint8_t>
Byte blob containing any FMT
fields that are not stored as explicit attributes
info_*
var<uint8_t>
One or more attributes storing specific VCF INFO
fields, e.g. info_DP
, info_MQ
, etc.
fmt_*
var<uint8_t>
One or more attributes storing specific VCF FORMAT
fields, e.g. fmt_GT
, fmt_MIN_DP
, etc.
info_TILEDB_IAF
var<float>
Computed allele frequency
The info_*
and fmt_*
attributes allow individual INFO
or FMT
VCF fields to be extracted into explicit array attributes. This can be beneficial if your queries frequently access only a subset of the INFO
or FMT
fields, as no unrelated data then needs to be fetched from storage.
The choice of which fields to extract as explicit array attributes is user-configurable during array creation.
Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes, info
and fmt
.
anchor_gap
Anchor gap value
extra_attributes
List of INFO
or FMT
field names that are stored as explicit array attributes
version
Array schema version
These metadata values are updated during array creation, and are used during the export phase. The metadata is stored as "array metadata" in the sparse data
array.
When ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header. This requirement will be relaxed in future versions.
The vcf_headers
array stores the original text of every ingested VCF header in order to:
ensure the original VCF file can be fully recovered for any given sample
reconstruct an htslib
header instance when reading from the dataset, which is used for operations such as mapping a filter ID back to the filter string, etc.
Parameter
Value
Array type
Sparse
Rank
1D
Cell order
Row-major
Tile order
Row-major
Dimension Name
TileDB Datatype
Description
sample
TILEDB_STRING_ASCII
Sample name
Attribute Name
TileDB Datatype
Description
header
var<char>
Original text of the VCF header
To summarize, we've described three main entities:
The variant data array (3D sparse)
The general metadata, stored in the variant data array as metadata
The VCF header array (1D sparse)
All together we term this a "TileDB-VCF dataset." Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:
The root of the dataset, <dataset_uri>
is a TileDB group. The data
member is the TileDB 3D sparse array storing the variant data. This array stores the general TileDB-VCF metadata as its array metadata in folder data/__meta
. The vcf_headers
member is the TileDB 1D sparse array containing the VCF header data.
During array creation, there are several array-related parameters that the user can control. These are:
Array data tile capacity (default 10,000)
The "anchor gap" size (default 1,000)
The list of INFO
and FMT
fields to store as explicit array attributes (default is none).
Once chosen, these parameters cannot be changed.
During sample ingestion, the user can specify the:
Sample batch size (default 10)
The above parameters may impact read and write performance, as well as the size of the persisted array. Therefore, some care should be taken to determine good values for these parameters before ingesting a large amount of data into an array.
Indexed files are required for ingestion. If your VCF/BCF files have not been indexed, you can use to do so:
You can ingest samples into an already created dataset as follows:
Just add a regular expression for the VCF file locations at the end of the store
command:
Alternatively, provide a text file with the absolute locations of the VCF files, separated by a new line:
Incremental updates work in the same manner as the ingestion above, nothing special is needed. In addition, the ingestion is thread- and process-safe and, therefore, can be performed in parallel.
The first step before ingesting any VCF samples is to create a dataset. This effectively creates a TileDB group and the appropriate empty arrays in it.
If you wish to turn some of the INFO
and FMT
fields into separate materialized attributes, you can do so as follows (names should be fmt_X
or info_X
for a field name X
- case sensitive).
Before slicing the data, you may wish to get some information about your dataset, such as the sample names, the attributes you can query, etc.
You can get the sample names as follows:
You can get the attributes as follows:
You can rapidly read from a TileDB-VCF dataset by providing three main parameters (all optional):
A subset of the samples
A subset of the attributes
One or more genomic ranges
Either as strings in format chr:pos_range
Or via a BED file
By default the high-level APIs (i.e., Python and Spark) will also build TileDB-VCF itself automatically, bundling the resulting shared libraries into the final packages. However, the APIs can also be built separately.
Please ensure the following dependencies are installed on your system before building TileDB-VCF:
CMake >= 3.3
C++ compiler supporting C++20 (such as gcc 10 or newer)
git
HTSlib 1.8
If HTSlib is not installed on your system, TileDB-VCF will download and build a local copy automatically. However, in order for this to work the following dependencies of HTSlib must be installed beforehand:
A pre-compiled HTSlib library will be downloaded automatically for Windows builds.
To install just the TileDB-VCF library and CLI, execute:
By default this will build and install TileDB-VCF into TileDB-VCF/dist
.
You can verify that the installation succeeded by checking the version via the CLI:
The high-level API build processes perform essentially the same steps as these, including installing the library and CLI binary intoTileDB-VCF/dist/bin
. So, if you build one of the APIs, the steps above will be executed automatically for you.
Installing the Python module also requires conda
. See for instructions on how to install .
The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the tiledbvcf
Python package. The package and bundled native libraries get installed into the active conda
environment.
You can optionally run the Python unit tests as follows:
To test that the package was installed correctly, run:
The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.
You can optionally run the Spark unit tests as follows:
To build an uber .jar
, which includes all dependencies, run:
This will place the .jar
file in the build/libs/
directory. The .jar.
file, also contains the bundled native libraries.
Creating a cluster may take 10–15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username hadoop
, and then perform the following setup steps:
Next, follow the steps above for building the Spark API using gradle. Make sure to run the ./gradlew jar
step to produce the TileDB-VCF jar.
Then launch the Spark shell specifying the .jar
and any other desired Spark configuration, e.g.:
The TileDB-VCF Python API contains several points of integration with for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. However, to quickly test on a local machine, run:
The Spark API is implemented as a and requires Spark 2.4.
Spark cluster management is outside the scope of this guide. However, you can can learn more about launching an EMR Spark cluster .