Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The first step before ingesting any VCF samples is to create a dataset. This effectively creates a TileDB group and the appropriate empty arrays in it.
If you wish to turn some of the INFO
and FMT
fields into separate materialized attributes, you can do so as follows (names should be fmt_X
or info_X
for a field name X
- case sensitive).
TileDB Open Source provides native support for reading from and writing to cloud object stores like AWS S3, Google Cloud Storage, and Microsoft Azure Blob Store. This guide will cover some considerations for using TileDB-VCF with these services. The examples will focus exclusively on S3, which is the most widely used, but note any of the aforementioned services can be substituted, as well as on-premise services like MinIO that provide S3-compatible APIs.
The process of creating a TileDB-VCF dataset on S3 is nearly identical to creating a local dataset. The only difference being an s3://
address is passed to the --uri
argument rather than a local file path.
This also works when querying a TileDB-VCF dataset located on S3.
VCF files located on S3 can be ingested directly into a TileDBVCF dataset using 1 of 2 different possible approaches.
The first approach is the easiest, you simply pass the tiledbvcf store
command a list of S3 URIs and TileDB-VCF takes care of the rest:
In this approach, remote VCF index files (which are relatively tiny) are downloaded locally, allowing TileDB-VCF to retrieve chunks of variant data from the remote VCF files without having to download them in full. By default, index files are downloaded to your current working directory, however, you can choose to store them in different location (e.g., a temporary directory) using the --scratch-dir
argument.
The second approach is to download batches of VCF files in their entirety before ingestion, which may slightly improve ingestion performance. This approach requires allocating TileDB-VCF with scratch disk space using the --scratch-mb
and --scratch-dir
arguments.
The number of VCF files that are downloaded at a time is determined by the --sample-batch-size
parameter, which defaults to 10. Downloading and ingestion happens asynchronously, so, for example, batch 3 will be downloaded as batch 2 is being ingestion. As a result, you must configure enough scratch space to store at least 20 samples, assuming a batch size of 10.
For TileDB to access a remote storage bucket you must be properly authenticated on the machine running TileDB. For S3, this means having access to the appropriate AWS access key ID and secret access key. This typically happens in one of three ways:
If the AWS Command Line Interface (CLI) is installed on your machine, running aws configure
will store your credentials in a local profile that TileDB can access. You can verify the CLI has been previously configured by running:
If properly configured, this will output a list of the S3 buckets you (and thus TileDB) can access.
You can pass your AWS access key ID and secret access key to TileDB-VCF directly via the --tiledb-config
argument, which expects a comma-separated string:
Your AWS credentials can also be passed to TileDB by defining the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
Before slicing the data, you may wish to get some information about your dataset, such as the sample names, the attributes you can query, etc.
You can get the sample names as follows:
You can get the attributes as follows:
You can rapidly read from a TileDB-VCF dataset by providing three main parameters (all optional):
A subset of the samples
A subset of the attributes
One or more genomic ranges
Either as strings in format chr:pos_range
Or via a BED file
TileDB-VCF's Spark API offers a DataSourceV2
data source to read TileDB-VCF datasets into a Spark dataframe. To begin, launch a Spark shell with:
Depending on the size of the dataset and query results, you may need to increase the configured memory from the defaults.:
While the Spark API offers much of the same functionality provided by the CLI, the main considerations when using the Spark API are typically dataframe partitioning and memory overhead.
There are two ways to partition a TileDB-VCF dataframe, which can be used separately or together:
Partitioning over samples (.option("sample_partitions", N)
).
Partitioning over genomic regions (.option("range_partitions", M)
).
Conceptually, these correspond to partitioning over rows and columns, respectively, in the underlying TileDB array. For example, if you are reading a subset of 200 samples in a dataset, and specify 10 sample partitions, Spark will create 10 jobs, each of which will be responsible for handling the export from 20 of the 200 selected samples.
Similarly, if the provided BED file contains 1,000,000 regions, and you specify 10 region partitions, Spark will create 10 jobs, each of which will be responsible for handling the export of 100,000 of the 1,000,000 regions (across all samples).
These two partitioning parameters can be composed to form rectangular regions of work to distribute across the available Spark executors.
The CLI interface offers the same partitioning feature, using the --sample-partition
and --region-partition
flags.
Because TileDB-VCF is implemented as a native library, the Spark API makes use of both JVM on-heap and off-heap memory. This can make it challenging to correctly tune the memory consumption of each Spark job.
The .option("memory", mb)
option is used as a best-effort memory budget that any particular job is allowed to consume, across on-heap and off-heap allocations. Increasing the available memory with this option can result in large performance improvements, provided the executors are configured with sufficient memory to avoid OOM failures.
To export a few genomic regions from a dataset (which must be accessible by the Spark jobs; here we assume it is located on S3):
Because there are only two regions specified, and two region partitions, each of the two Spark jobs started will read one of the given regions.
We can also place a list of sample names and a list of genomic regions to read in explicit text files, and use those during reading:
Here, we've also increased the memory budget to 8GB, and added partitioning over samples.
You can use the TileDB-VCF CLI to export the TileDB-VCF ingested dataset back into VCF formats for downstream analyses, in a lossless way.
While these exports are lossless in terms of the actual data stored, they may not be identical to the original files. For example, fields within the INFO
and FORMAT
columns may appear in a slightly different order in the exported files.
To recreate all of original (single-sample) VCF files simply run the export
command and set the --output-format
to v
, for VCF.
If bcftools
is available on your system you can use it to easily examine any of the exported files:
The same mechanics covered in the for filtering records by sample and genomic region also apply to exporting VCF files.
Unlike TileDB-VCF's CLI, which exports directly to disk, results for queries performed using Python are read into memory. Therefore, when querying even moderately sized genomic datasets, the amount of available memory must be taken into consideration.
This guide demonstrates several of the TileDB-VCF features for overcoming memory limitations when querying large datasets.
One strategy for accommodating large queries is to simply increase the amount of memory available to tiledbvcf
. By default tiledbvcf
allocates 2GB of memory for queries. However, this value can be adjusted using the memory_budget_mb
parameter. For the purposes of this tutorial the budget will be decreased to demonstrate how tiledbvcf
is able to perform genome-scale queries even in a memory constrained environment.
For queries that encompass many genomic regions you can simply provide an external bed
file. In this example, you will query for any variants located in the promoter region of a known gene located on chromosomes 1-4.
After performing a query, you can use read_completed()
to verify whether or not all results were successfully returned.
In this case, it returned False
, indicating the requested data was too large to fit into the allocated memory so tiledbvcf
retrieved as many records as possible in this first batch. The remaining records can be retrieved using continue_read()
. Here, we've setup our code to accommodate the possibility that the full set of results are split across multiple batches.
Here is the final dataframe, which includes 3,808,687 records:
A Python generator version of the read
method is also provided. This pattern provides a powerful interface for batch processing variant data.
This section includes useful guides about the usage of TileDB-VCF:
Indexed files are required for ingestion. If your VCF/BCF files have not been indexed, you can use to do so:
You can ingest samples into an already created dataset as follows:
Just add a regular expression for the VCF file locations at the end of the store
command:
Alternatively, provide a text file with the absolute locations of the VCF files, separated by a new line:
Incremental updates work in the same manner as the ingestion above, nothing special is needed. In addition, the ingestion is thread- and process-safe and, therefore, can be performed in parallel.
The tiledbvcf
Python package includes integration with to enable distributing large queries across node clusters.
You can use the tiledbvcf
package's Dask integration to partition read operations across regions and samples. The partitioning semantics are identical to those used by the CLI and Spark.
The result is a Dask dataframe (rather than a Pandas dataframe). We're using a local machine for simplicity but the API works on any Dask cluster.
If you plan to perform filter the results in a Dask dataframe, it may be more efficient to use map_dask()
rather than read_dask()
. The map_dask()
function takes an additional parameter, fnc
, allowing you to provide a filtering function that is applied immediately after performing the read but before inserting the result of the partition into the Dask dataframe.
In the following example, any variants overlapping regions in very-large-bedfile.bed
are filtered out if their start position overlaps the first 25kb of the chromosome.
This approach can be more efficient than using read_dask()
with a separate filtering step because it avoids the possibility that partitions require multiple read operations due to memory constraints.
The pseudocode describing the read_partition()
algorithm (i.e., the code responsible for reading the partition on a Dask worker) is:
When using map_dask()
instead, the pseudocode becomes:
You can see that if the provided filter_fnc()
reduces the size of the data substantially, using map_dask()
can reduce the likelihood that the Dask workers will run out of memory and avoid needing to perform multiple reads.