Python

Ingestion

There is an experimental API available that can be used to ingest samples using the TileDB-VCF Python API. This API is not fully stable yet, and is subject to change:

Python
Python
import tiledbvcf
ds = tiledbvcf.TileDBVCFDataset('my-dataset', mode='w')
samples = ['{}.bcf'.format(i) for i in range(1, 6)]
ds.ingest_samples(samples)

Reading

The Python API provides very similar functionality to the Spark API. By default, data is read and returned as a Pandas dataframe, but integration with Dask is also available, producing data as a Dask dataframe.

Read into Pandas Dataframe

To read data from two regions and three samples:

Python
Python
import tiledbvcf
ds = tiledbvcf.TileDBVCFDataset('my-dataset', mode='r')
df = ds.read(attrs=['sample_name', 'pos_start', 'pos_end'],
regions=['chr1:1000-2000', 'chr2:500-501'],
samples=['sampleA', 'sampleB', 'sampleD'])

The read operation returns df, a Pandas dataframe object, with 3 columns containing the sample name, start position and end position of each result record.

Handling Large Reads

For very large reads, the Python API also allows batches of samples to be read separately, which can enable incremental/out-of-core processing in the event that the requested data is too large to fit in main memory.

Python
Python
import tiledbvcf
ds = tiledbvcf.TileDBVCFDataset('my-large-dataset', mode='r')
df = ds.read(attrs=['sample_name', 'pos_start', 'pos_end'],
bed_file='very-large-bedfile.bed')
# Run your function to process the first dataframe:
your_processing_function(df)
while not ds.read_completed():
df = ds.continue_read()
your_processing_function(df)

This example reads with a large BED file (containing many regions), across all samples in the dataset. The results are returned in batches, where each batch is a new dataframe object that can be processed independently.

There is also a corresponding API that uses Python generator expressions to accomplish the same task:

Python
Python
import tiledbvcf
ds = tiledbvcf.TileDBVCFDataset('my-large-dataset', mode='r')
for df in ds.read_iter(attrs=['sample_name', 'pos_start', 'pos_end'],
bed_file='very-large-bedfile.bed'):
your_processing_function(df)

Read into Dask Dataframe

In this example we use the TileDB-VCF Python API with Dask to partition the read operations across regions and samples. The partitioning semantics are identical to Spark. The result is a Dask dataframe (instead of the normal Pandas dataframe). We focus on a local machine for simplicity, but the API works on any Dask cluster.

Python
Python
import tiledbvcf
import dask
ds = tiledbvcf.TileDBVCFDataset('my-large-dataset', mode='r')
dask_df = ds.read_dask(attrs=['sample_name', 'pos_start', 'pos_end'],
bed_file='very-large-bedfile.bed',
region_partitions=10,
sample_partitions=2)

This example performs the same "very large" read as the previous example, but using Dask to partition the work across the available Dask workers.

If you plan to perform filtering on the results in the Dask dataframe, in some cases it can be more efficient to use the map_dask API instead of the more basic read_dask. The map_dask function takes an additional parameter, a function that performs the filtering, and applies it immediately after performing the read, before returning the result of the partition "into" the Dask dataframe.

The following simple example filters out results from the dataframe whose pos_start attribute is less than 25000:

Python
Python
import tiledbvcf
ds = tiledbvcf.TileDBVCFDataset('my-large-dataset', mode='r')
dask_df = ds.map_dask(lambda df: df[df.pos_start < 25000],
attrs=['sample_name', 'pos_start', 'pos_end'],
bed_file='very-large-bedfile.bed',
region_partitions=10,
sample_partitions=2)

The reason this API can be more efficient than read_dask with a separate filtering process is due to the possibility that the partitions may require multiple read operations if the result is incomplete. The pseudocode describing the read_partition() algorithm (i.e. the code responsible on a Dask worker for reading the partition) is:

Pseudocode
Pseudocode
ds = tiledbvcf.TileDBVCFDataset(uri, mode='r')
overall_df = ds.read(attrs, samples, regions, ...)
while not ds.read_completed():
df = ds.continue_read()
overall_df = overall_df.append(df)

When using the map_dask API instead, the pseudocode becomes:

Pseudocode
Pseudocode
ds = tiledbvcf.TileDBVCFDataset(uri, mode='r')
overall_df = filter_fnc(ds.read(attrs, samples, regions, ...))
while not ds.read_completed():
df = filter_fnc(ds.continue_read())
overall_df = overall_df.append(df)

You can easily see that if the provided filter_fnc reduces the size of the data substantially, using map_dask can make the difference between the Dask workers running out of memory, and being able to run to completion.