Python
tiledbvcf.dataset
tiledbvcf.dataset
This is the main Python module.
Dataset
Dataset
Representation of the grouped TileDB arrays that constitute a TileDB-VCF dataset, which includes a sparse 3D array containing the actual variant data and a sparse 1D array containing various sample metadata and the VCF header lines. Read more about the data model here.
Arguments
uri
: URI of TileDB-VCF datasetmode
: (default'r'
) Open the array object in read'r'
or write'w'
modecfg
: TileDB-VCF configuration (optional)stats
: (bool
) Enable of disable TileDB statsverbose
: (bool
) Enable of disable TileDB-VCF verbose output
create_dataset()
create_dataset()
Create a new TileDB-VCF dataset.
Arguments
extra_attrs
: (list of str attrs) list of extra attributes to materialize from theFMT
orINFO
field. Names should befmt_X
orinfo_X
for a field nameX
(case sensitive).tile_capacity
: (int) Tile capacity to use for the array schema (default10000
)anchor_gap
: (int) Length of gaps between inserted anchor records in bases (default =1000
)checksum_type
: (str checksum) Optional override checksum type for creating new dataset (valid values are'sha256'
,'md5'
orNone
)allow_duplicates
: (bool
) Controls whether records with duplicate end positions can be ingested written to the dataset
ingest_samples()
ingest_samples()
Ingest samples into an existing TileDB-VCF dataset.
Arguments:
sample_uris
: (list of str samples) CSV list of VCF/BCF sample URIs to ingestthreads
: (int) Set the number of threads used for ingestionthread_task_size
: (int) Set the max length (# columns) of an ingestion task (affects load balancing of ingestion work across threads and total memory consumption)memory_budget_mb
: (int) Set the max size (MB) of TileDB buffers before flushing (default1024
)record_limit
str scratch_space_path
: (str
) Directory used for local storage of downloaded remote samplesscratch_space_size
: (int
) Amount of local storage that can be used for downloading remote samples (MB)sample_batch_size
: (int
) Number of samples per batch for ingestion (default10
)record_limit
: Limit the number of VCF records read into memory per file (default50000
)resume
: (bool
) Whether to check and attempt to resume a partial completed ingestion
read()
/ read_arrow()
read()
/ read_arrow()
Reads data from a TileDB-VCF dataset into a Pandas Dataframe (with read()
) or a PyArrow Array (with read_arrow()
).
Arguments
attrs
: (list of str attrs) List of attributes to extract. Can include attributes from the VCF INFO and FORMAT fields (prefixed withinfo_
andfmt_
, respectively) as well as any of the builtin attributes:sample_name
id
contig
alleles
filters
pos_start
pos_end
qual
query_bed_end
query_bed_start
query_bed_line
samples
: (list of str samples) CSV list of sample names to be readregions
: (list of str regions) CSV list of genomic regions to be readsamples_file
: (str filesystem location) URI of file containing sample names to be read, one per linebed_file
: (str filesystem location) URI of a BED file of genomic regions to be readskip_check_samples
: (bool) Should checking the samples requested exist in the arraydisable_progress_estimation
: (bool) Should we skip estimating the progress in verbose mode? Estimating progress can have performance or memory impacts in some cases.
Details
For large datasets, a call to read()
may not be able to fit all results in memory. In that case, the returned dataframe will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read()
function.
You can also use the Python generator version, read_iter()
.
Returns: Pandas DataFrame
containing results.
read_completed()
read_completed()
Details
A read is considered complete if the resulting dataframe contained all results.
Returns: (bool
) True
if the previous read operation was complete
count()
count()
Counts data in a TileDB-VCF dataset.
Arguments
samples
: (list of str samples) CSV list of sample names to include in the countregions
: (list of str regions) CSV list of genomic regions to include in the count
Details
Returns: Number of intersecting records in the dataset
attributes()
attributes()
List queryable attributes available in the VCF dataset
Arguments
attr_type
: (list of str attributes) The subset of attributes to retrieve;"info"
or"fmt"
will only retrieve attributes ingested from the VCFINFO
andFORMAT
fields, respectively,"builtin"
retrieves the static attributes defined in TileDB-VCF's schema,"all"
(the default) returns all queryable attributes
Details
Returns: a list of strings representing the attribute names
tiledbvcf.ReadConfig
tiledbvcf.ReadConfig
Set various configuration parameters.
Parameters
limit
: max number of records (rows) to readregion_partition
: Region partition tuple (idx, num_partitions)sample_partition
: Samples partition tuple (idx, num_partitions)sort_regions
: Whether or not to sort the regions to be read (defaultTrue
)memory_budget_mb
: Memory budget (MB) for buffer and internal allocations (default2048
)buffer_percentage
: Percentage of memory to dedicate to TileDB Query Buffers (default:25
)tiledb_tile_cache_percentage
: Percentage of memory to dedicate to TileDB Tile Cache (default:10
)tiledb_config
: List of strings in the format"option=value"
(see here for full list TileDB configuration parameters)
tiledbvcf.dask
tiledbvcf.dask
This module is for the TileDB-VCF integration with Dask.
read_dask()
read_dask()
Reads data from a TileDB-VCF dataset into a Dask DataFrame
.
Arguments
attrs
: (list of str attrs) List of attribute names to be readregion_partitions
(int partitions) Number of partitions over regionssample_partitions
(int partitions) Number of partitions over samplessamples
: (list of str samples) CSV list of sample names to be readregions
: (list of str regions) CSV list of genomic regions to be readsamples_file
: (str filesystem location) URI of file containing sample names to be read, one per linebed_file
: (str filesystem location) URI of a BED file of genomic regions to be read
Details
Partitioning proceeds by a straightforward block distribution, parameterized by the total number of partitions and the particular partition index that a particular read operation is responsible for.
Both region and sample partitioning can be used together.
Returns: Dask DataFrame
with results
map_dask()
map_dask()
Maps a function on a Dask DataFrame
obtained by reading from the dataset.
Arguments
fnc
: (function) Function applied to each partitionattrs
: (list of str attrs) List of attribute names to be readregion_partitions
(int partitions) Number of partitions over regionssample_partitions
(int partitions) Number of partitions over samplessamples
: (list of str samples) CSV list of sample names to be readregions
: (list of str regions) CSV list of genomic regions to be readsamples_file
: (str filesystem location) URI of file containing sample names to be read, one per linebed_file
: (str filesystem location) URI of a BED file of genomic regions to be read
Details
May be more efficient in some cases than read_dask()
followed by a regular Dask map operation.
Returns: Dask DataFrame
with results
Last updated