There are 2 basic operations that TileDB-VCF supports:

  • Ingestion: Import a collection of VCF/BCF files into a TileDB-VCF dataset. You can use the TileDB-VCF CLI or the Python API for that. We also describe how you can do scalable ingestion with AWS Batch.

  • Reading: You can read from a TileDB-VCF dataset into a Python pandas or Dask dataframe, a Spark dataframe, or export into individual VCF/BCF files.

After ingesting a VCF/BCF collection into a TileDB-VCF dataset, you can perceive for simplicity the dataset as a dataframe with columns sample_name, pos, end, alleles. TileDB-VCF allows you to read or export by providing:

  • Any set of sample names (optional, the default is all samples)

  • Any set of genomic position ranges (optional, the default is the entire genomic domain)

  • Any set of columns/attributes (optional, the default is all columns)

The result will be either a pandas/Dask/Spark dataframe (for reads) or a set of VCF/BCF files (for exports) including only the columns and entries satisfying your query inputs.

For more details on the various columns/attributes of a TileDB-VCF dataset, you can see the Data Model page.