Both sample ingestion and reading/export with TileDB-VCF are inherently parallelizable operations. The Spark and Dask integrations offer parallelizable in-memory reads. With the AWS Batch integration, TileDB-VCF can perform multi-node ingestion and export of data to/from BCF files.
The AWS Batch integration consists of two scripts that wrap the execution of the
tiledbvcf command-line client. The
tiledbvcf CLI offers the same export partitioning arguments as Spark and Dask, namely the ability to distribute across samples and/or regions being exported. For ingestion, the list of samples to be ingested can be split into multiple disjoint lists, and each ingestion process invoked with a different list of samples to ingest.
TileDB-VCF provides the two wrapper scripts (
batch-ingest.py for ingestion and
batch-export.py for export) that interface with an existing AWS Batch environment, divvy up the work, and start the Batch jobs.
To run these scripts, first create an environment by navigating to the
/path/to/TileDB-VCF/apis/aws-batch directory and execute:
python3 -m venv venv && source venv/bin/activatepip install -r requirements.txt
This will set up a Python virtual environment with the requirements to run the
The most basic usage of AWS Batch for distributed ingestion specifies a destination dataset (which will be created if it does not yet exist), a file containing a list of sample URIs to be ingested, and the number of Batch jobs to enqueue.
To ingest a list of samples:
python batch-ingest.py \--dataset-uri s3://my-bucket/my-dataset \--samples samples.txt \--metadata-s3 my-metadata-bucket \--job-queue tiledbvcf-job-queue \--job-definition tiledbvcf-job-defn \--num-jobs 10 \--wait
This will split up the lines in
samples.txt into 10 batches, upload those batches (stored in separate files) into the given "metadata" bucket
my-metadata-bucket, and enqueue 10 Batch jobs in the given queue, each responsible for ingesting one of the sample batches.
--wait argument will cause the
batch-ingest.py script to wait until all jobs have finished (success or failure) before exiting.
If you recall from the description of the create/register/store phases of ingestion, the registration phase must be executed serially. Therefore the above command will actually enqueue 12 jobs total: 1 job performing the create step, 1 job performing the register step for all given samples, and 10 jobs performing the store steps. The register job depends on the create job, and the store jobs depend on the register job. Only the store jobs execute in parallel.
With AWS Batch, it is perfectly allowable to enqueue more jobs than you have compute resources for; the extra jobs will simply be dequeued when a node becomes available to execute it.
Exporting with AWS Batch uses the
batch-export.py script, which takes many of the same arguments as
To export a list of samples (whose names are listed in the file
s3://my-bucket/sample-names.txt) using a BED file:
python batch-export.py \--dataset-uri s3://my-bucket/my-dataset \--dest-s3 s3://my-bucket/exported-bcfs/ \--samples s3://my-bucket/sample-names.txt \--bed s3://my-bucket/to-export.bed \--job-queue tiledbvcf-job-queue \--job-definition tiledbvcf-job-defn \--num-jobs 10 \--wait
The exported BCFs (one per sample) will be uploaded to the prefix
s3://my-bucket/exported-bcfs/ upon completion of a job.
Because one BCF is produced per sample being exported, the
num-jobs parameter determines the partitioning that takes place over samples being exported. If there are 100 sample names listed in
sample-names.txt, then specifying 10 jobs will result in each job exporting 10 samples, using the specified BED file.