Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
TileDB-VCF is an open-source C++ library for efficient storage and retrieval of genomic variant call data based on TileDB Open Source.
TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute. The resulting software, GenomicsDB, became part of Broad's GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of TileDB Open Source, incorporating new algorithms, features and optimizations. TileDB-VCF is the product of a collaboration between TileDB, Inc. and Helix.
TileDB-VCF offers several important benefits:
Performance: TileDB-VCF is optimized for rapidly slicing variant records by genomic regions across arbitrary numbers of samples. Additional features for ingesting VCF files and performing genomic interval intersections are all implemented in C++ for maximum speed.
Compressibility: TileDB-VCF can efficiently store any number of samples in a compressed and lossless manner. Being a columnar format, TileDB-VCF can compress different VCF fields with different compressors, choosing the most appropriate one depending on the data types and nature of the data per field.
Optimized for cloud object stores: Built on TileDB Open Source, TileDB-VCF inherits all its features out of the box, such as the speed and optimizations on numerous storage backends, including cloud object stores such as AWS S3, Azure Blob Storage and Google Cloud Storage.
Updatability: TileDB-VCF allows you to rapidly add new samples to existing datasets, eliminating the so-called N+1 problem and scaling both storage and update time linearly with the number of new samples.
Interoperability: In addition to a command-line interface, TileDB-VCF provides C++ and Python APIs, as well as integrations with Spark and Dask. This allows the user to utilize a plethora of tools from the Data Science and Machine Learning ecosystems in a seamless and natural fashion, without having to convert data from one format to another or use unfamiliar APIs.
Open-source: TileDB Open Source and TileDB-VCF are both open source and developed under the permissive MIT license. We are passionate about supporting the scientific community with open-source software and welcome feedback and contributions to further improve our projects.
Battle-tested: TileDB-VCF has been used in production by very large customers to manage immense amounts of variant data, in the order of many hundreds of thousands of samples.
The documentation of TileDB-VCF includes information about the data model (and its mapping to 3D sparse arrays), installation instructions, How To guides and the API reference:
We are building the following extensions for Genomics:
Indexed files are required for ingestion. If your VCF/BCF files have not been indexed, you can use to do so:
You can ingest samples into an already created dataset as follows:
Just add a regular expression for the VCF file locations at the end of the store
command:
Alternatively, provide a text file with the absolute locations of the VCF files, separated by a new line:
Incremental updates work in the same manner as the ingestion above, nothing special is needed. In addition, the ingestion is thread- and process-safe and, therefore, can be performed in parallel.
You can install TileDB-VCF in two ways:
Before slicing the data, you may wish to get some information about your dataset, such as the sample names, the attributes you can query, etc.
You can get the sample names as follows:
You can get the attributes as follows:
You can rapidly read from a TileDB-VCF dataset by providing three main parameters (all optional):
A subset of the samples
A subset of the attributes
One or more genomic ranges
Either as strings in format chr:pos_range
Or via a BED file
The first step before ingesting any VCF samples is to create a dataset. This effectively creates a TileDB group and the appropriate empty arrays in it.
If you wish to turn some of the INFO
and FMT
fields into separate materialized attributes, you can do so as follows (names should be fmt_X
or info_X
for a field name X
- case sensitive).
TileDB Open Source provides native support for reading from and writing to cloud object stores like AWS S3, Google Cloud Storage, and Microsoft Azure Blob Store. This guide will cover some considerations for using TileDB-VCF with these services. The examples will focus exclusively on S3, which is the most widely used, but note any of the aforementioned services can be substituted, as well as on-premise services like MinIO that provide S3-compatible APIs.
The process of creating a TileDB-VCF dataset on S3 is nearly identical to creating a local dataset. The only difference being an s3://
address is passed to the --uri
argument rather than a local file path.
This also works when querying a TileDB-VCF dataset located on S3.
VCF files located on S3 can be ingested directly into a TileDBVCF dataset using 1 of 2 different possible approaches.
The first approach is the easiest, you simply pass the tiledbvcf store
command a list of S3 URIs and TileDB-VCF takes care of the rest:
In this approach, remote VCF index files (which are relatively tiny) are downloaded locally, allowing TileDB-VCF to retrieve chunks of variant data from the remote VCF files without having to download them in full. By default, index files are downloaded to your current working directory, however, you can choose to store them in different location (e.g., a temporary directory) using the --scratch-dir
argument.
The second approach is to download batches of VCF files in their entirety before ingestion, which may slightly improve ingestion performance. This approach requires allocating TileDB-VCF with scratch disk space using the --scratch-mb
and --scratch-dir
arguments.
The number of VCF files that are downloaded at a time is determined by the --sample-batch-size
parameter, which defaults to 10. Downloading and ingestion happens asynchronously, so, for example, batch 3 will be downloaded as batch 2 is being ingestion. As a result, you must configure enough scratch space to store at least 20 samples, assuming a batch size of 10.
For TileDB to access a remote storage bucket you must be properly authenticated on the machine running TileDB. For S3, this means having access to the appropriate AWS access key ID and secret access key. This typically happens in one of three ways:
If the AWS Command Line Interface (CLI) is installed on your machine, running aws configure
will store your credentials in a local profile that TileDB can access. You can verify the CLI has been previously configured by running:
If properly configured, this will output a list of the S3 buckets you (and thus TileDB) can access.
You can pass your AWS access key ID and secret access key to TileDB-VCF directly via the --tiledb-config
argument, which expects a comma-separated string:
Your AWS credentials can also be passed to TileDB by defining the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
This section includes useful guides about the usage of TileDB-VCF:
By default the high-level APIs (i.e., Python and Spark) will also build TileDB-VCF itself automatically, bundling the resulting shared libraries into the final packages. However, the APIs can also be built separately.
Please ensure the following dependencies are installed on your system before building TileDB-VCF:
CMake >= 3.3
C++ compiler supporting C++20 (such as gcc 10 or newer)
git
HTSlib 1.8
If HTSlib is not installed on your system, TileDB-VCF will download and build a local copy automatically. However, in order for this to work the following dependencies of HTSlib must be installed beforehand:
A pre-compiled HTSlib library will be downloaded automatically for Windows builds.
To install just the TileDB-VCF library and CLI, execute:
By default this will build and install TileDB-VCF into TileDB-VCF/dist
.
You can verify that the installation succeeded by checking the version via the CLI:
The high-level API build processes perform essentially the same steps as these, including installing the library and CLI binary intoTileDB-VCF/dist/bin
. So, if you build one of the APIs, the steps above will be executed automatically for you.
Installing the Python module also requires conda
. See for instructions on how to install .
The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library, the Python bindings, and the tiledbvcf
Python package. The package and bundled native libraries get installed into the active conda
environment.
You can optionally run the Python unit tests as follows:
To test that the package was installed correctly, run:
The last step may take up to 10 minutes, depending on your system, as it automatically builds the TileDB-VCF library and the Spark API.
You can optionally run the Spark unit tests as follows:
To build an uber .jar
, which includes all dependencies, run:
This will place the .jar
file in the build/libs/
directory. The .jar.
file, also contains the bundled native libraries.
Creating a cluster may take 10–15 minutes. When it is ready, find the public DNS of the master node (it will be shown on the cluster Summary tab). SSH into the master node using username hadoop
, and then perform the following setup steps:
Next, follow the steps above for building the Spark API using gradle. Make sure to run the ./gradlew jar
step to produce the TileDB-VCF jar.
Then launch the Spark shell specifying the .jar
and any other desired Spark configuration, e.g.:
This is the API reference for TileDB-VCF:
The TileDB-VCF Python API contains several points of integration with for parallel computing on the TileDB-VCF arrays. Describing how to set up a Dask cluster is out of the scope of this guide. However, to quickly test on a local machine, run:
The Spark API is implemented as a and requires Spark 2.4.
Spark cluster management is outside the scope of this guide. However, you can can learn more about launching an EMR Spark cluster .
You can easily install TileDB-VCF via conda
or use one of our Docker images.
Pre-built Docker images are available on Docker Hub.
tiledbvcf-cli
for the CLI
tiledbvcf-py
for the Python package
latest
: latest stable release (recommended)
dev
: development version
v0.x.x
for a specific version
You can use the TileDB-VCF CLI to export the TileDB-VCF ingested dataset back into VCF formats for downstream analyses, in a lossless way.
While these exports are lossless in terms of the actual data stored, they may not be identical to the original files. For example, fields within the INFO
and FORMAT
columns may appear in a slightly different order in the exported files.
To recreate all of original (single-sample) VCF files simply run the export
command and set the --output-format
to v
, for VCF.
If bcftools
is available on your system you can use it to easily examine any of the exported files:
The same mechanics covered in the reading for filtering records by sample and genomic region also apply to exporting VCF files.
TileDB-VCF's Spark API offers a DataSourceV2
data source to read TileDB-VCF datasets into a Spark dataframe. To begin, launch a Spark shell with:
Depending on the size of the dataset and query results, you may need to increase the configured memory from the defaults.:
While the Spark API offers much of the same functionality provided by the CLI, the main considerations when using the Spark API are typically dataframe partitioning and memory overhead.
There are two ways to partition a TileDB-VCF dataframe, which can be used separately or together:
Partitioning over samples (.option("sample_partitions", N)
).
Partitioning over genomic regions (.option("range_partitions", M)
).
Conceptually, these correspond to partitioning over rows and columns, respectively, in the underlying TileDB array. For example, if you are reading a subset of 200 samples in a dataset, and specify 10 sample partitions, Spark will create 10 jobs, each of which will be responsible for handling the export from 20 of the 200 selected samples.
Similarly, if the provided BED file contains 1,000,000 regions, and you specify 10 region partitions, Spark will create 10 jobs, each of which will be responsible for handling the export of 100,000 of the 1,000,000 regions (across all samples).
These two partitioning parameters can be composed to form rectangular regions of work to distribute across the available Spark executors.
The CLI interface offers the same partitioning feature, using the --sample-partition
and --region-partition
flags.
Because TileDB-VCF is implemented as a native library, the Spark API makes use of both JVM on-heap and off-heap memory. This can make it challenging to correctly tune the memory consumption of each Spark job.
The .option("memory", mb)
option is used as a best-effort memory budget that any particular job is allowed to consume, across on-heap and off-heap allocations. Increasing the available memory with this option can result in large performance improvements, provided the executors are configured with sufficient memory to avoid OOM failures.
To export a few genomic regions from a dataset (which must be accessible by the Spark jobs; here we assume it is located on S3):
Because there are only two regions specified, and two region partitions, each of the two Spark jobs started will read one of the given regions.
We can also place a list of sample names and a list of genomic regions to read in explicit text files, and use those during reading:
Here, we've also increased the memory budget to 8GB, and added partitioning over samples.
TileDB-VCF uses 3D sparse arrays to store genomic variant data. This section describes the technical implementation details about the underlying model, including array schemas, dimensions, tiling order, attributes, and metadata. It is recommended that you read the following section to fully understand the mapping between the TileDB-VCF data and TileDB arrays and groups:
A TileDB-VCF dataset is composed of a group of two separate TileDB arrays:
a 3D sparse array for the actual genomic variants and associated fields/attributes
a 1D sparse array for the metadata stored in each single-sample VCF header
The dimensions in the schema are:
The coordinates of the 3D array are contig
along the first dimension, chromosomal location of the variants start position along the second dimension, and sample
names along the third dimension.
For each field in a single-sample VCF record there is a corresponding attribute in the schema.
The info_*
and fmt_*
attributes allow individual INFO
or FMT
VCF fields to be extracted into explicit array attributes. This can be beneficial if your queries frequently access only a subset of the INFO
or FMT
fields, as no unrelated data then needs to be fetched from storage.
The choice of which fields to extract as explicit array attributes is user-configurable during array creation.
Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes, info
and fmt
.
anchor_gap
Anchor gap value
extra_attributes
List of INFO
or FMT
field names that are stored as explicit array attributes
version
Array schema version
These metadata values are updated during array creation, and are used during the export phase. The metadata is stored as "array metadata" in the sparse data
array.
When ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header. This requirement will be relaxed in future versions.
The vcf_headers
array stores the original text of every ingested VCF header in order to:
ensure the original VCF file can be fully recovered for any given sample
reconstruct an htslib
header instance when reading from the dataset, which is used for operations such as mapping a filter ID back to the filter string, etc.
To summarize, we've described three main entities:
The variant data array (3D sparse)
The general metadata, stored in the variant data array as metadata
The VCF header array (1D sparse)
All together we term this a "TileDB-VCF dataset." Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:
The root of the dataset, <dataset_uri>
is a TileDB group. The data
member is the TileDB 3D sparse array storing the variant data. This array stores the general TileDB-VCF metadata as its array metadata in folder data/__meta
. The vcf_headers
member is the TileDB 1D sparse array containing the VCF header data.
During array creation, there are several array-related parameters that the user can control. These are:
Array data tile capacity (default 10,000)
The "anchor gap" size (default 1,000)
The list of INFO
and FMT
fields to store as explicit array attributes (default is none).
Once chosen, these parameters cannot be changed.
During sample ingestion, the user can specify the:
Sample batch size (default 10)
The above parameters may impact read and write performance, as well as the size of the persisted array. Therefore, some care should be taken to determine good values for these parameters before ingesting a large amount of data into an array.
Dimension Name
TileDB Datatype
Corresponding VCF Field
contig
TILEDB_STRING_ASCII
CHR
start_pos
uint32_t
VCFPOS
plus TileDB anchors
sample
TILEDB_STRING_ASCII
Sample name
Attribute Name
TileDB Datatype
Description
end_pos
uint32_t
VCF END
position of VCF records
qual
float
VCF QUAL
field
alleles
var<char>
CSV list of REF
and ALT
VCF fields
id
var<char>
VCF ID
field
filter_ids
var<int32_t>
Vector of integer IDs of entries in the FILTER
VCF field
real_start_pos
uint32_t
VCF POS
(no anchors)
info
var<uint8_t>
Byte blob containing any INFO
fields that are not stored as explicit attributes
fmt
var<uint8_t>
Byte blob containing any FMT
fields that are not stored as explicit attributes
info_*
var<uint8_t>
One or more attributes storing specific VCF INFO
fields, e.g. info_DP
, info_MQ
, etc.
fmt_*
var<uint8_t>
One or more attributes storing specific VCF FORMAT
fields, e.g. fmt_GT
, fmt_MIN_DP
, etc.
info_TILEDB_IAF
var<float>
Computed allele frequency
Parameter
Value
Array type
Sparse
Rank
1D
Cell order
Row-major
Tile order
Row-major
Dimension Name
TileDB Datatype
Description
sample
TILEDB_STRING_ASCII
Sample name
Attribute Name
TileDB Datatype
Description
header
var<char>
Original text of the VCF header
create
a new dataset
store
specified VCF files in a dataset
export
data from a dataset
list
all sample names present in a dataset
stat
prints high-level statistics about a dataset
utils
utility functions for dataset
Create an empty TileDB-VCF dataset.
Flag
Description
-u
,--uri
TileDB dataset URI.
-a
,--attributes
Info or format field names (comma-delimited) to store as separate attributes. Names should be fmt_X
or info_X
for a field name X
(case sensitive).
-c
,--tile-capacity
Tile capacity to use for the array schema [default 10000
].
-g
,--anchor-gap
Anchor gap size to use [default 1000
].
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
--checksum
Checksum to use for dataset validation on read and writes [default "sha256"
].
-n
,--no-duplicates
Do not allow records with duplicate end positions to be written to the array.
Ingests registered samples into a TileDB-VCF dataset.
Flag
Description
-u
,--uri
TileDB dataset URI.
-t
,--threads
Number of threads [default 16
].
-p
,--s3-part-size
[S3 only] Part size to use for writes (MB) [default 50
].
-d
,--scratch-dir
Directory used for local storage of downloaded remote samples.
-s
,--scratch-mb
Amount of local storage (in MB) allocated for downloading remote VCF files prior to ingestion [default 0
]. The you must configure enough scratch space to hold at least 20 samples. In general, you need 2 × the sample dimension's sample_bactch_size (which by default is 10). You can read more about the data model here.
-n
, --max-record-buff
Max number of BCF records to buffer per file [default 50000
].
-k
, --thread-task-size
Max length (# columns) of an ingestion task. Affects load balancing of ingestion work across threads, and total memory consumption [default 5000000
].
-b
, --mem-budget-mb
The total memory budget (MB) used when submitting TileDB queries [default 1024
].
-v
, --verbose
Enable verbose output.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
-f
, --samples-file
File with 1 VCF path to be ingested per line. The format can also include an explicit index path on each line, in the format <vcf-uri><TAB><index-uri>
.
--remove-sample-file
If specified, the samples file (-f
argument) is deleted after successful ingestion
-e
, --sample-batch-size
Number of samples per batch for ingestion [default 10
].
--stats
Enable TileDB stats
--stats-vcf-header-array
Enable TileDB stats for vcf header array usage.
--resume
Resume incomplete ingestion of sample batch.
Exports data from a TileDB-VCF dataset.
Flag
Description
-u
,--uri
TileDB dataset URI.
-O
,--output-format
Export format. Options are: b
: bcf (compressed); u
: bcf; z
: vcf.gz; v
: vcf; t
: TSV. [default b
] .
-o
,--output-path
[TSV export only] The name of the output TSV file.
-t
,--tsv-fields
[TSV export only] An ordered CSV list of fields to export in the TSV. A field name can be one of SAMPLE
, ID
, REF
, ALT
, QUAL
, POS
, CHR
, FILTER
. Additionally, INFO
fields can be specified by I
and FMT
fields with S
. To export the intersecting query region for each row in the output, use the field names Q:POS
, Q:END
, or Q:LINE
.
-r
,--regions
CSV list of regions to export in the format chr:min-max
.
-R
,--regions-file
File containing regions (BED format).
--sorted
Do not sort regions or regions file if they are pre-sorted.
-n
,--limit
Only export the first N intersecting records.
-d
,--output-dir
Directory used for local output of exported samples.
--sample-partition
Partitions the list of samples to be exported and causes this export to export only a specific partition of them. Specify in the format I:N
where I
is the partition index and N
is the total number of partitions. Useful for batch exports.
--region-partition
Partitions the list of regions to be exported and causes this export to export only a specific partition of them. Specify in the format I:N
where I
is the partition index and N
is the total number of partitions. Useful for batch exports.
--upload-dir
If set, all output file(s) from the export process will be copied to the given directory (or S3 prefix) upon completion.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
-v
,--verbose
Enable verbose output.
-c
,--count-only
Don't write output files, only print the count of the resulting number of intersecting records.
-b
,--mem-budget-mb
The memory budget (MB) used when submitting TileDB queries [default 2048
].
--mem-budget-buffer-percentage
The percentage of the memory budget to use for TileDB query buffers [default 25
].
--mem-budget-tile-cache-percentage
The percentage of the memory budget to use for TileDB tile cache [default 10
].
-f
,--samples-file
File with 1 VCF path to be registered per line. The format can also include an explicit index path on each line, in the format in the format <vcf-uri><TAB><index-uri>
.
--stats
Enable TileDB stats
--stats-vcf-header-array
Enable TileDB stats for vcf header array usage.
--disable-check-samples
Disable validating that samples passed exist in dataset before executing query and error if any sample requested is not in the dataset.
--disable-progress-estimation
Disable progress estimation in verbose mode. Progress estimation can sometimes cause a performance impact.
Lists all sample names present in a TileDB-VCF dataset.
Flag
Description
-u
,--uri
TileDB dataset URI.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
Prints high-level statistics about a TileDB-VCF dataset.
Flag
Description
-u
,--uri
TileDB dataset URI.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
Utils for working with a TileDB-VCF dataset, such for consolidating and vacuuming fragments or fragment metadata.
Flag
Description
-u
,--uri
TileDB dataset URI.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
tiledbvcf.dataset
This is the main Python module.
Dataset
Representation of the grouped TileDB arrays that constitute a TileDB-VCF dataset, which includes a sparse 3D array containing the actual variant data and a sparse 1D array containing various sample metadata and the VCF header lines. Read more about the data model here.
uri
: URI of TileDB-VCF dataset
mode
: (default 'r'
) Open the array object in read 'r'
or write 'w'
mode
cfg
: TileDB-VCF configuration (optional)
stats
: (bool
) Enable of disable TileDB stats
verbose
: (bool
) Enable of disable TileDB-VCF verbose output
create_dataset()
Create a new TileDB-VCF dataset.
extra_attrs
: (list of str attrs) list of extra attributes to materialize from the FMT
or INFO
field. Names should be fmt_X
or info_X
for a field name X
(case sensitive).
tile_capacity
: (int) Tile capacity to use for the array schema (default 10000
)
anchor_gap
: (int) Length of gaps between inserted anchor records in bases (default = 1000
)
checksum_type
: (str checksum) Optional override checksum type for creating new dataset (valid values are 'sha256'
, 'md5'
or None
)
allow_duplicates
: (bool
) Controls whether records with duplicate end positions can be ingested written to the dataset
ingest_samples()
Ingest samples into an existing TileDB-VCF dataset.
sample_uris
: (list of str samples) CSV list of VCF/BCF sample URIs to ingest
threads
: (int) Set the number of threads used for ingestion
thread_task_size
: (int) Set the max length (# columns) of an ingestion task (affects load balancing of ingestion work across threads and total memory consumption)
memory_budget_mb
: (int) Set the max size (MB) of TileDB buffers before flushing (default 1024
)
record_limit
str scratch_space_path
: (str
) Directory used for local storage of downloaded remote samples
scratch_space_size
: (int
) Amount of local storage that can be used for downloading remote samples (MB)
sample_batch_size
: (int
) Number of samples per batch for ingestion (default 10
)
record_limit
: Limit the number of VCF records read into memory per file (default 50000
)
resume
: (bool
) Whether to check and attempt to resume a partial completed ingestion
read()
/ read_arrow()
Reads data from a TileDB-VCF dataset into a Pandas Dataframe (with read()
) or a PyArrow Array (with read_arrow()
).
attrs
: (list of str attrs) List of attributes to extract. Can include attributes from the VCF INFO and FORMAT fields (prefixed with info_
and fmt_
, respectively) as well as any of the builtin attributes:
sample_name
id
contig
alleles
filters
pos_start
pos_end
qual
query_bed_end
query_bed_start
query_bed_line
samples
: (list of str samples) CSV list of sample names to be read
regions
: (list of str regions) CSV list of genomic regions to be read
samples_file
: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file
: (str filesystem location) URI of a BED file of genomic regions to be read
skip_check_samples
: (bool) Should checking the samples requested exist in the array
disable_progress_estimation
: (bool) Should we skip estimating the progress in verbose mode? Estimating progress can have performance or memory impacts in some cases.
For large datasets, a call to read()
may not be able to fit all results in memory. In that case, the returned dataframe will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read()
function.
You can also use the Python generator version, read_iter()
.
Returns: Pandas DataFrame
containing results.
read_completed()
A read is considered complete if the resulting dataframe contained all results.
Returns: (bool
) True
if the previous read operation was complete
count()
Counts data in a TileDB-VCF dataset.
samples
: (list of str samples) CSV list of sample names to include in the count
regions
: (list of str regions) CSV list of genomic regions to include in the count
Returns: Number of intersecting records in the dataset
attributes()
List queryable attributes available in the VCF dataset
attr_type
: (list of str attributes) The subset of attributes to retrieve; "info"
or "fmt"
will only retrieve attributes ingested from the VCF INFO
and FORMAT
fields, respectively, "builtin"
retrieves the static attributes defined in TileDB-VCF's schema, "all"
(the default) returns all queryable attributes
Returns: a list of strings representing the attribute names
tiledbvcf.ReadConfig
Set various configuration parameters.
Parameters
limit
: max number of records (rows) to read
region_partition
: Region partition tuple (idx, num_partitions)
sample_partition
: Samples partition tuple (idx, num_partitions)
sort_regions
: Whether or not to sort the regions to be read (default True
)
memory_budget_mb
: Memory budget (MB) for buffer and internal allocations (default 2048
)
buffer_percentage
: Percentage of memory to dedicate to TileDB Query Buffers (default: 25
)
tiledb_tile_cache_percentage
: Percentage of memory to dedicate to TileDB Tile Cache (default: 10
)
tiledb_config
: List of strings in the format "option=value"
(see here for full list TileDB configuration parameters)
tiledbvcf.dask
This module is for the TileDB-VCF integration with Dask.
read_dask()
Reads data from a TileDB-VCF dataset into a Dask DataFrame
.
attrs
: (list of str attrs) List of attribute names to be read
region_partitions
(int partitions) Number of partitions over regions
sample_partitions
(int partitions) Number of partitions over samples
samples
: (list of str samples) CSV list of sample names to be read
regions
: (list of str regions) CSV list of genomic regions to be read
samples_file
: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file
: (str filesystem location) URI of a BED file of genomic regions to be read
Partitioning proceeds by a straightforward block distribution, parameterized by the total number of partitions and the particular partition index that a particular read operation is responsible for.
Both region and sample partitioning can be used together.
Returns: Dask DataFrame
with results
map_dask()
Maps a function on a Dask DataFrame
obtained by reading from the dataset.
fnc
: (function) Function applied to each partition
attrs
: (list of str attrs) List of attribute names to be read
region_partitions
(int partitions) Number of partitions over regions
sample_partitions
(int partitions) Number of partitions over samples
samples
: (list of str samples) CSV list of sample names to be read
regions
: (list of str regions) CSV list of genomic regions to be read
samples_file
: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file
: (str filesystem location) URI of a BED file of genomic regions to be read
May be more efficient in some cases than read_dask()
followed by a regular Dask map operation.
Returns: Dask DataFrame
with results
Parameter
Value
Array type
Sparse
Rank
3D
Cell order
Row-major
Tile order
Row-major
Unlike TileDB-VCF's CLI, which exports directly to disk, results for queries performed using Python are read into memory. Therefore, when querying even moderately sized genomic datasets, the amount of available memory must be taken into consideration.
This guide demonstrates several of the TileDB-VCF features for overcoming memory limitations when querying large datasets.
One strategy for accommodating large queries is to simply increase the amount of memory available to tiledbvcf
. By default tiledbvcf
allocates 2GB of memory for queries. However, this value can be adjusted using the memory_budget_mb
parameter. For the purposes of this tutorial the budget will be decreased to demonstrate how tiledbvcf
is able to perform genome-scale queries even in a memory constrained environment.
For queries that encompass many genomic regions you can simply provide an external bed
file. In this example, you will query for any variants located in the promoter region of a known gene located on chromosomes 1-4.
After performing a query, you can use read_completed()
to verify whether or not all results were successfully returned.
In this case, it returned False
, indicating the requested data was too large to fit into the allocated memory so tiledbvcf
retrieved as many records as possible in this first batch. The remaining records can be retrieved using continue_read()
. Here, we've setup our code to accommodate the possibility that the full set of results are split across multiple batches.
Here is the final dataframe, which includes 3,808,687 records:
A Python generator version of the read
method is also provided. This pattern provides a powerful interface for batch processing variant data.
The tiledbvcf
Python package includes integration with to enable distributing large queries across node clusters.
You can use the tiledbvcf
package's Dask integration to partition read operations across regions and samples. The partitioning semantics are identical to those used by the CLI and Spark.
The result is a Dask dataframe (rather than a Pandas dataframe). We're using a local machine for simplicity but the API works on any Dask cluster.
If you plan to perform filter the results in a Dask dataframe, it may be more efficient to use map_dask()
rather than read_dask()
. The map_dask()
function takes an additional parameter, fnc
, allowing you to provide a filtering function that is applied immediately after performing the read but before inserting the result of the partition into the Dask dataframe.
In the following example, any variants overlapping regions in very-large-bedfile.bed
are filtered out if their start position overlaps the first 25kb of the chromosome.
This approach can be more efficient than using read_dask()
with a separate filtering step because it avoids the possibility that partitions require multiple read operations due to memory constraints.
The pseudocode describing the read_partition()
algorithm (i.e., the code responsible for reading the partition on a Dask worker) is:
When using map_dask()
instead, the pseudocode becomes:
You can see that if the provided filter_fnc()
reduces the size of the data substantially, using map_dask()
can reduce the likelihood that the Dask workers will run out of memory and avoid needing to perform multiple reads.