AWS Batch

Both sample ingestion and reading/export with TileDB-VCF are inherently parallelizable operations. The Spark and Dask integrations offer parallelizable in-memory reads. With the AWS Batch integration, TileDB-VCF can perform multi-node ingestion and export of data to/from BCF files.

The AWS Batch integration consists of two scripts that wrap the execution of the tiledbvcf command-line client. The tiledbvcf CLI offers the same export partitioning arguments as Spark and Dask, namely the ability to distribute across samples and/or regions being exported. For ingestion, the list of samples to be ingested can be split into multiple disjoint lists, and each ingestion process invoked with a different list of samples to ingest.

TileDB-VCF provides the two wrapper scripts (batch-ingest.py for ingestion and batch-export.py for export) that interface with an existing AWS Batch environment, divvy up the work, and start the Batch jobs.

To run these scripts, first create an environment by navigating to the /path/to/TileDB-VCF/apis/aws-batch directory and execute:

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

This will set up a Python virtual environment with the requirements to run the batch-{ingest,export}.py scripts.

Ingestion

The most basic usage of AWS Batch for distributed ingestion specifies a destination dataset (which will be created if it does not yet exist), a file containing a list of sample URIs to be ingested, and the number of Batch jobs to enqueue.

To ingest a list of samples:

python batch-ingest.py \
--dataset-uri s3://my-bucket/my-dataset \
--samples samples.txt \
--metadata-s3 my-metadata-bucket \
--job-queue tiledbvcf-job-queue \
--job-definition tiledbvcf-job-defn \
--num-jobs 10 \
--wait

This will split up the lines in samples.txt into 10 batches, upload those batches (stored in separate files) into the given "metadata" bucket my-metadata-bucket, and enqueue 10 Batch jobs in the given queue, each responsible for ingesting one of the sample batches.

The --wait argument will cause the batch-ingest.py script to wait until all jobs have finished (success or failure) before exiting.

If you recall from the description of the create/register/store phases of ingestion, the registration phase must be executed serially. Therefore the above command will actually enqueue 12 jobs total: 1 job performing the create step, 1 job performing the register step for all given samples, and 10 jobs performing the store steps. The register job depends on the create job, and the store jobs depend on the register job. Only the store jobs execute in parallel.

With AWS Batch, it is perfectly allowable to enqueue more jobs than you have compute resources for; the extra jobs will simply be dequeued when a node becomes available to execute it.

The files uploaded to the metadata bucket are automatically cleaned up on successful termination of each job. However, if a job fails, the metadata files may not be cleaned up, depending on the phase of execution in which the job failed.

Export

Exporting with AWS Batch uses the batch-export.py script, which takes many of the same arguments as batch-ingest.py.

To export a list of samples (whose names are listed in the file s3://my-bucket/sample-names.txt) using a BED file:

python batch-export.py \
--dataset-uri s3://my-bucket/my-dataset \
--dest-s3 s3://my-bucket/exported-bcfs/ \
--samples s3://my-bucket/sample-names.txt \
--bed s3://my-bucket/to-export.bed \
--job-queue tiledbvcf-job-queue \
--job-definition tiledbvcf-job-defn \
--num-jobs 10 \
--wait

The exported BCFs (one per sample) will be uploaded to the prefix s3://my-bucket/exported-bcfs/ upon completion of a job.

Because one BCF is produced per sample being exported, the num-jobs parameter determines the partitioning that takes place over samples being exported. If there are 100 sample names listed in sample-names.txt, then specifying 10 jobs will result in each job exporting 10 samples, using the specified BED file.