Before slicing the data, you may wish to get some information about your dataset, such as the sample names, the attributes you can query, etc.
You can get the sample names as follows:
import tiledbvcfuri ="my_vcf_dataset"ds = tiledbvcf.Dataset(uri, mode ="r")# open in "Read" modeds.samples()
tiledbvcf list -u my_vcf_dataset
You can get the attributes as follows:
import tiledbvcfuri ="my_vcf_dataset"ds = tiledbvcf.Dataset(uri, mode ="r")# open in "Read" modeds.attributes()# will print all queryable attributesds.attributes(attr_type ="builtin")# will print all materialized attributes
tiledbvcf stat -u my_vcf_datset
Reading
You can rapidly read from a TileDB-VCF dataset by providing three main parameters (all optional):
A subset of the samples
A subset of the attributes
One or more genomic ranges
Either as strings in format chr:pos_range
Or via a BED file
import tiledbvcfuri ="my_vcf_dataset"ds = tiledbvcf.Dataset(uri, mode ="r")# open in "Read" modeds.read( attrs = ["alleles", "pos_start", "pos_end"], regions = ["1:113409605-113475691", "1:113500000-113600000"],# or pass regions as follows:# bed_file = <bed_filename> samples = ['HG0099', 'HG00100']# Set allele frequency filter with# set_af_filter="<0.5")
tiledbvcf export \
--uri my_vcf_dataset \
--output-format t \
--tsv-fields ALT,Q:POS,Q:END
--sample-names HG0099,HG00100
--regions 1:113409605-113475691,1:113500000-113600000
# or pass the regions in a BED file as follows:
# --regions-file <bed_filename>