Read from the Dataset

Basic Utils

Before slicing the data, you may wish to get some information about your dataset, such as the sample names, the attributes you can query, etc.

You can get the sample names as follows:

import tiledbvcf

uri = "my_vcf_dataset" 
ds = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds.samples()

You can get the attributes as follows:

import tiledbvcf

uri = "my_vcf_dataset" 
ds = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds.attributes()                      # will print all queryable attributes
ds.attributes(attr_type = "builtin") # will print all materialized attributes

Reading

You can rapidly read from a TileDB-VCF dataset by providing three main parameters (all optional):

  1. A subset of the samples

  2. A subset of the attributes

  3. One or more genomic ranges

    1. Either as strings in format chr:pos_range

    2. Or via a BED file

import tiledbvcf

uri = "my_vcf_dataset" 
ds = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds.read(
    attrs = ["alleles", "pos_start", "pos_end"],
    regions = ["1:113409605-113475691", "1:113500000-113600000"],
    # or pass regions as follows:
    # bed_file = <bed_filename>
    samples = ['HG0099', 'HG00100']
    # Set allele frequency filter with
    # set_af_filter="<0.5"
)

Last updated