There are 2 basic operations that TileDB-VCF supports:
Ingestion: Import a collection of VCF/BCF files into a TileDB-VCF dataset. You can use the TileDB-VCF CLI or the Python API for that. We also describe how you can do scalable ingestion with AWS Batch.
Reading: You can read from a TileDB-VCF dataset into a Python pandas or Dask dataframe, a Spark dataframe, or export into individual VCF/BCF files.
After ingesting a VCF/BCF collection into a TileDB-VCF dataset, you can perceive for simplicity the dataset as a dataframe with columns
alleles. TileDB-VCF allows you to read or export by providing:
Any set of sample names (optional, the default is all samples)
Any set of genomic position ranges (optional, the default is the entire genomic domain)
Any set of columns/attributes (optional, the default is all columns)
The result will be either a pandas/Dask/Spark dataframe (for reads) or a set of VCF/BCF files (for exports) including only the columns and entries satisfying your query inputs.
For more details on the various columns/attributes of a TileDB-VCF dataset, you can see the Data Model page.