tiledbvcf.dataset
This is the main Python module.
Dataset
Representation of the grouped TileDB arrays that constitute a TileDB-VCF dataset, which includes a sparse 3D array containing the actual variant data and a sparse 1D array containing various sample metadata and the VCF header lines. Read more about the data model here.
uri
: URI of TileDB-VCF dataset
mode
: (default 'r'
) Open the array object in read 'r'
or write 'w'
mode
cfg
: TileDB-VCF configuration (optional)
stats
: (bool
) Enable of disable TileDB stats
verbose
: (bool
) Enable of disable TileDB-VCF verbose output
create_dataset()
Create a new TileDB-VCF dataset.
extra_attrs
: (list of str attrs) list of extra attributes to materialize from the FMT
or INFO
field. Names should be fmt_X
or info_X
for a field name X
(case sensitive).
tile_capacity
: (int) Tile capacity to use for the array schema (default 10000
)
anchor_gap
: (int) Length of gaps between inserted anchor records in bases (default = 1000
)
checksum_type
: (str checksum) Optional override checksum type for creating new dataset (valid values are 'sha256'
, 'md5'
or None
)
allow_duplicates
: (bool
) Controls whether records with duplicate end positions can be ingested written to the dataset
ingest_samples()
Ingest samples into an existing TileDB-VCF dataset.
sample_uris
: (list of str samples) CSV list of VCF/BCF sample URIs to ingest
threads
: (int) Set the number of threads used for ingestion
thread_task_size
: (int) Set the max length (# columns) of an ingestion task (affects load balancing of ingestion work across threads and total memory consumption)
memory_budget_mb
: (int) Set the max size (MB) of TileDB buffers before flushing (default 1024
)
record_limit
str scratch_space_path
: (str
) Directory used for local storage of downloaded remote samples
scratch_space_size
: (int
) Amount of local storage that can be used for downloading remote samples (MB)
sample_batch_size
: (int
) Number of samples per batch for ingestion (default 10
)
record_limit
: Limit the number of VCF records read into memory per file (default 50000
)
resume
: (bool
) Whether to check and attempt to resume a partial completed ingestion
read()
/ read_arrow()
Reads data from a TileDB-VCF dataset into a Pandas Dataframe (with read()
) or a PyArrow Array (with read_arrow()
).
attrs
: (list of str attrs) List of attributes to extract. Can include attributes from the VCF INFO and FORMAT fields (prefixed with info_
and fmt_
, respectively) as well as any of the builtin attributes:
sample_name
id
contig
alleles
filters
pos_start
pos_end
qual
query_bed_end
query_bed_start
query_bed_line
samples
: (list of str samples) CSV list of sample names to be read
regions
: (list of str regions) CSV list of genomic regions to be read
samples_file
: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file
: (str filesystem location) URI of a BED file of genomic regions to be read
skip_check_samples
: (bool) Should checking the samples requested exist in the array
disable_progress_estimation
: (bool) Should we skip estimating the progress in verbose mode? Estimating progress can have performance or memory impacts in some cases.
For large datasets, a call to read()
may not be able to fit all results in memory. In that case, the returned dataframe will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read()
function.
You can also use the Python generator version, read_iter()
.
Returns: Pandas DataFrame
containing results.
read_completed()
A read is considered complete if the resulting dataframe contained all results.
Returns: (bool
) True
if the previous read operation was complete
count()
Counts data in a TileDB-VCF dataset.
samples
: (list of str samples) CSV list of sample names to include in the count
regions
: (list of str regions) CSV list of genomic regions to include in the count
Returns: Number of intersecting records in the dataset
attributes()
List queryable attributes available in the VCF dataset
attr_type
: (list of str attributes) The subset of attributes to retrieve; "info"
or "fmt"
will only retrieve attributes ingested from the VCF INFO
and FORMAT
fields, respectively, "builtin"
retrieves the static attributes defined in TileDB-VCF's schema, "all"
(the default) returns all queryable attributes
Returns: a list of strings representing the attribute names
tiledbvcf.ReadConfig
Set various configuration parameters.
Parameters
limit
: max number of records (rows) to read
region_partition
: Region partition tuple (idx, num_partitions)
sample_partition
: Samples partition tuple (idx, num_partitions)
sort_regions
: Whether or not to sort the regions to be read (default True
)
memory_budget_mb
: Memory budget (MB) for buffer and internal allocations (default 2048
)
buffer_percentage
: Percentage of memory to dedicate to TileDB Query Buffers (default: 25
)
tiledb_tile_cache_percentage
: Percentage of memory to dedicate to TileDB Tile Cache (default: 10
)
tiledb_config
: List of strings in the format "option=value"
(see here for full list TileDB configuration parameters)
tiledbvcf.dask
This module is for the TileDB-VCF integration with Dask.
read_dask()
Reads data from a TileDB-VCF dataset into a Dask DataFrame
.
attrs
: (list of str attrs) List of attribute names to be read
region_partitions
(int partitions) Number of partitions over regions
sample_partitions
(int partitions) Number of partitions over samples
samples
: (list of str samples) CSV list of sample names to be read
regions
: (list of str regions) CSV list of genomic regions to be read
samples_file
: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file
: (str filesystem location) URI of a BED file of genomic regions to be read
Partitioning proceeds by a straightforward block distribution, parameterized by the total number of partitions and the particular partition index that a particular read operation is responsible for.
Both region and sample partitioning can be used together.
Returns: Dask DataFrame
with results
map_dask()
Maps a function on a Dask DataFrame
obtained by reading from the dataset.
fnc
: (function) Function applied to each partition
attrs
: (list of str attrs) List of attribute names to be read
region_partitions
(int partitions) Number of partitions over regions
sample_partitions
(int partitions) Number of partitions over samples
samples
: (list of str samples) CSV list of sample names to be read
regions
: (list of str regions) CSV list of genomic regions to be read
samples_file
: (str filesystem location) URI of file containing sample names to be read, one per line
bed_file
: (str filesystem location) URI of a BED file of genomic regions to be read
May be more efficient in some cases than read_dask()
followed by a regular Dask map operation.
Returns: Dask DataFrame
with results
This is the API reference for TileDB-VCF:
create
a new dataset
store
specified VCF files in a dataset
export
data from a dataset
list
all sample names present in a dataset
stat
prints high-level statistics about a dataset
utils
utility functions for dataset
Create an empty TileDB-VCF dataset.
Ingests registered samples into a TileDB-VCF dataset.
Exports data from a TileDB-VCF dataset.
Lists all sample names present in a TileDB-VCF dataset.
Prints high-level statistics about a TileDB-VCF dataset.
Utils for working with a TileDB-VCF dataset, such for consolidating and vacuuming fragments or fragment metadata.
Flag
Description
-u
,--uri
TileDB dataset URI.
-a
,--attributes
Info or format field names (comma-delimited) to store as separate attributes. Names should be fmt_X
or info_X
for a field name X
(case sensitive).
-c
,--tile-capacity
Tile capacity to use for the array schema [default 10000
].
-g
,--anchor-gap
Anchor gap size to use [default 1000
].
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
--checksum
Checksum to use for dataset validation on read and writes [default "sha256"
].
-n
,--no-duplicates
Do not allow records with duplicate end positions to be written to the array.
Flag
Description
-u
,--uri
TileDB dataset URI.
-t
,--threads
Number of threads [default 16
].
-p
,--s3-part-size
[S3 only] Part size to use for writes (MB) [default 50
].
-d
,--scratch-dir
Directory used for local storage of downloaded remote samples.
-s
,--scratch-mb
Amount of local storage (in MB) allocated for downloading remote VCF files prior to ingestion [default 0
]. The you must configure enough scratch space to hold at least 20 samples. In general, you need 2 × the sample dimension's sample_bactch_size (which by default is 10). You can read more about the data model here.
-n
, --max-record-buff
Max number of BCF records to buffer per file [default 50000
].
-k
, --thread-task-size
Max length (# columns) of an ingestion task. Affects load balancing of ingestion work across threads, and total memory consumption [default 5000000
].
-b
, --mem-budget-mb
The total memory budget (MB) used when submitting TileDB queries [default 1024
].
-v
, --verbose
Enable verbose output.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
-f
, --samples-file
File with 1 VCF path to be ingested per line. The format can also include an explicit index path on each line, in the format <vcf-uri><TAB><index-uri>
.
--remove-sample-file
If specified, the samples file (-f
argument) is deleted after successful ingestion
-e
, --sample-batch-size
Number of samples per batch for ingestion [default 10
].
--stats
Enable TileDB stats
--stats-vcf-header-array
Enable TileDB stats for vcf header array usage.
--resume
Resume incomplete ingestion of sample batch.
Flag
Description
-u
,--uri
TileDB dataset URI.
-O
,--output-format
Export format. Options are: b
: bcf (compressed); u
: bcf; z
: vcf.gz; v
: vcf; t
: TSV. [default b
] .
-o
,--output-path
[TSV export only] The name of the output TSV file.
-t
,--tsv-fields
[TSV export only] An ordered CSV list of fields to export in the TSV. A field name can be one of SAMPLE
, ID
, REF
, ALT
, QUAL
, POS
, CHR
, FILTER
. Additionally, INFO
fields can be specified by I
and FMT
fields with S
. To export the intersecting query region for each row in the output, use the field names Q:POS
, Q:END
, or Q:LINE
.
-r
,--regions
CSV list of regions to export in the format chr:min-max
.
-R
,--regions-file
File containing regions (BED format).
--sorted
Do not sort regions or regions file if they are pre-sorted.
-n
,--limit
Only export the first N intersecting records.
-d
,--output-dir
Directory used for local output of exported samples.
--sample-partition
Partitions the list of samples to be exported and causes this export to export only a specific partition of them. Specify in the format I:N
where I
is the partition index and N
is the total number of partitions. Useful for batch exports.
--region-partition
Partitions the list of regions to be exported and causes this export to export only a specific partition of them. Specify in the format I:N
where I
is the partition index and N
is the total number of partitions. Useful for batch exports.
--upload-dir
If set, all output file(s) from the export process will be copied to the given directory (or S3 prefix) upon completion.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
-v
,--verbose
Enable verbose output.
-c
,--count-only
Don't write output files, only print the count of the resulting number of intersecting records.
-b
,--mem-budget-mb
The memory budget (MB) used when submitting TileDB queries [default 2048
].
--mem-budget-buffer-percentage
The percentage of the memory budget to use for TileDB query buffers [default 25
].
--mem-budget-tile-cache-percentage
The percentage of the memory budget to use for TileDB tile cache [default 10
].
-f
,--samples-file
File with 1 VCF path to be registered per line. The format can also include an explicit index path on each line, in the format in the format <vcf-uri><TAB><index-uri>
.
--stats
Enable TileDB stats
--stats-vcf-header-array
Enable TileDB stats for vcf header array usage.
--disable-check-samples
Disable validating that samples passed exist in dataset before executing query and error if any sample requested is not in the dataset.
--disable-progress-estimation
Disable progress estimation in verbose mode. Progress estimation can sometimes cause a performance impact.
Flag
Description
-u
,--uri
TileDB dataset URI.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
Flag
Description
-u
,--uri
TileDB dataset URI.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.
Flag
Description
-u
,--uri
TileDB dataset URI.
--tiledb-config
CSV string of the format 'param1=val1,param2=val2...'
specifying optional TileDB configuration parameter settings.