CLI

Ingestion

The primary interface for creating TileDB-VCF datasets and ingesting samples is the command-line interface (CLI) tool.

As explained in the Ingestion Algorithm, there are three phases to ingestion: create, register, and store. These phases each correspond to a different command mode:

  • ./tiledbvcf create

    Create an empty TileDB-VCF dataset

  • ./tiledbvcf register

    Register new samples, required before their actual ingestion

  • ./tiledbvcf store Ingest the new samples

Executing these commands without any further arguments will display the help page for each mode. Next we'll look at a few examples.

New Dataset Creation

In this example we will create a new dataset and ingest 10 samples into it. We'll assume that the samples are named 1.bcf through 10.bcf and that they reside in the same directory where you are executing tiledbvcf.

First, if the samples have not been indexed, index them using e.g. bcftools:

$ for f in *.bcf; do bcftools index -c $f; done

It is required to index the samples before they can be ingested.

Next, create the empty dataset. We'll leave all parameters at their defaults, except we will specify that the dataset should store the GT and MIN_DP VCF fields as separate attributes:

$ tiledbvcf create --uri my-dataset --attributes fmt_GT,fmt_MIN_DP

This creates the empty dataset my-dataset in the current directory. If there is already a directory named my-dataset, this command will produce an error.

Next, register the samples. Because there are only a few, we can pass them directly as a command line argument:

$ tiledbvcf register --uri my-dataset *.bcf

Lastly, ingest the samples.

$ tiledbvcf store --uri my-dataset *.bcf

If we had been ingesting a large number of samples, it can be more convenient to put a list of their filenames into a separate text file, and reference that during registration and ingestion:

$ cat samples.txt
1.bcf
2.bcf
...
10.bcf
$ tiledbvcf register --uri my-dataset --samples-file samples.txt
$ tiledbvcf store --uri my-dataset --samples-file samples.txt

In general the filenames listed in samples.txt can be full paths to anywhere on the filesystem, or S3 URIs.

Ingest to an Existing Dataset

In this example we'll ingest 5 new samples to the dataset from the previous example. Suppose they are named new1.bcf through new5.bcf.This can be done quite simply:

$ tiledbvcf register --uri my-dataset new*.bcf
$ tiledbvcf store --uri my-dataset new*.bcf

The only difference here is the absence of a create step.

Ingest From S3

Commonly you will want to ingest samples that are resident somewhere on S3. TileDB-VCF handles ingesting samples from S3 transparently, the only difference being that some local disk space must be allocated for TileDB-VCF to use during ingestion.

Suppose we want to ingest 5 more samples to the dataset created in the previous examples, and they are located on S3. We'll assume the samples are located at the URIs s3://my-bucket/1.bcf through s3://my-bucket/5.bcf. We can simply pass these URIs to TileDB-VCF, along with two parameters specifying the scratch space that can be used:

$ cat samples.txt
s3://my-bucket/1.bcf
s3://my-bucket/2.bcf
...
s3://my-bucket/5.bcf
$ tiledbvcf register \
--uri my-dataset \
--scratch-dir /tmp \
--scratch-mb 1024 \
--samples-file samples.txt
$ tiledbvcf store \
--uri my-dataset \
--scratch-dir /tmp \
--scratch-mb 1024 \
--samples-file samples.txt

Here we have specified that 1GB (1024MB) of disk space in the /tmp directory can be used by the ingestion process.

When ingesting samples from S3, you must configure enough disk scratch space to hold at least 20 samples (in general, 2 * row_tile_extent samples).

Export

The TileDB-VCF export process takes a list of sample names and a set of genomic regions, and produces as output all VCF/BCF records in the dataset that intersect any of the given genomic ranges, for the specified samples.

The tiledbvcf export command is used to export data using the command line. It can produce output in several forms:

  • VCF (or BCF), producing one .bcf output file per exported sample

  • TSV, producing a single tab-separated text file containing all intersecting records across the exported samples

  • Count-only, producing not an output file but a count of the total number of intersecting records across the exported samples.

The following examples demonstrate basic usage of the CLI for export; see the tiledbvcf export --help page for more information.

Export to BCF

In this example we will export several genomic regions from the my-dataset dataset created in the above ingestion examples:

$ tiledbvcf export \
--uri my-dataset \
--regions chr1:1000-2000,chr2:500-501 \
--samples sampleA,sampleB,sampleD

This will produce three output files, sample{A,B,D}.bcf containing the records for each sample that intersect either chr1:1000-2000 or chr2:500-501. If a sample did not have any intersecting records, the output file is still created, but it will contain 0 records.

By specifying no regions, the entire sample is exported. For example, to recover the original BCFs for two samples:

$ tiledbvcf export --uri my-dataset --samples sampleA,sampleC

This will produce two output BCFs containing the original BCFs for the specified sample, including the BCF header.

The files will not be completely identical with the original BCFs due to small differences such as the inclusion of END fields for every record, when in the original BCF the END field may have been elided. Ordering/sorting of the INFO/FMT fields may also differ.

Export to TSV

Next we'll export in TSV form, selecting a few columns of interest:

$ tiledbvcf export \
--uri my-dataset \
--regions chr1:1000-2000,chr2:500-501 \
--samples sampleA,sampleB,sampleD \
--output-format t \
--tsv-fields SAMPLE,POS,I:END,Q:POS,Q:END \
--output-path exported.txt

This will produce a single output file exported.txt containing a row per record (from all selected samples) with several columns:

  • SAMPLE: The sample name containing the record

  • POS: The starting position of the intersecting record

  • I:END: The END INFO column, i.e. the end position of the intersecting record.

  • Q:POS: The starting position of the given query region that intersected the record.

  • Q:END: The ending position of the intersecting query region

The Q: fields will contain the start/end position of the query region that intersected the record. In this example, that would be either chr1:1000-2000 or chr2:500-501.

The tiledbvcf export --help page lists the other available columns for TSV export.