Unlike TileDB-VCF's CLI, which exports directly to disk, results for queries performed using Python are read into memory. Therefore, when querying even moderately sized genomic datasets, the amount of available memory must be taken into consideration.
This guide demonstrates several of the TileDB-VCF features for overcoming memory limitations when querying large datasets.
One strategy for accommodating large queries is to simply increase the amount of memory available to tiledbvcf
. By default tiledbvcf
allocates 2GB of memory for queries. However, this value can be adjusted using the memory_budget_mb
parameter. For the purposes of this tutorial the budget will be decreased to demonstrate how tiledbvcf
is able to perform genome-scale queries even in a memory constrained environment.
For queries that encompass many genomic regions you can simply provide an external bed
file. In this example, you will query for any variants located in the promoter region of a known gene located on chromosomes 1-4.
After performing a query, you can use read_completed()
to verify whether or not all results were successfully returned.
In this case, it returned False
, indicating the requested data was too large to fit into the allocated memory so tiledbvcf
retrieved as many records as possible in this first batch. The remaining records can be retrieved using continue_read()
. Here, we've setup our code to accommodate the possibility that the full set of results are split across multiple batches.
Here is the final dataframe, which includes 3,808,687 records:
A Python generator version of the read
method is also provided. This pattern provides a powerful interface for batch processing variant data.