AWS Batch

The tiledbvcf CLI can be used with AWS Batch for easy scalability of both ingestion of VCF/BCF to TileDB, and export form TileDB back to VCF/BCF. If you are not familiar with AWS Batch, the following high-level steps should provide enough of a guide to get a basic environment set up. For details on the terminology, configuration, etc. of AWS Batch, see the AWS Batch documentation.

1. Create a Custom AMI

The ingestion and export process with tiledbvcf can require a large amount of local scratch space to store the VCF/BCF files. The AMI you create will have a pre-mounted scratch volume somewhere accessible inside the container.

  1. Create a new EC2 instance using the AWS-provided AMI Amazon ECS-Optimized Amazon Linux 2 AMI (instance type does not matter).

  2. In the "Add Storage" step, add a large (e.g. 1TB) EBS volume. Make sure to enable "Delete on Termination".

  3. Launch the instance and SSH into it. Then run the following commands:

    sudo yum -y update
    sudo mkfs -t ext4 /dev/nvme1n1
    sudo mkdir /data
    sudo echo -e '/dev/nvme1n1\t/data\text4\tdefaults\t0\t0' | sudo tee -a /etc/fstab
    sudo mount -a
    sudo chmod 0777 /data
    sudo systemctl stop ecs
    sudo rm -rf /var/lib/ecs/data/ecs_agent_data.json
  4. Log out of the instance, and create an AMI from it (while still running).

  5. Terminate the instance, as it is no longer needed.

The AWS Batch compute environment for TileDB-VCF will use this AMI. The batch-ingest.py and batch-export.py scripts are responsible for using the correct the EBS volume path (/data) and size with the tiledbvcf CLI.

Make sure you enable "Delete on termination" for the attached EBS volume, otherwise the EBS volumes will not be removed when batch jobs terminate.

2. Deploy the Docker Image

To build the TileDB-VCF Docker image containing the CLI:

$ cd TileDB-VCF/
$ docker build -f docker/Dockerfile-cli -t tiledbvcf-cli libtiledbvcf

To be usable in AWS Batch, you must then push the docker image to ECR, e.g.:

$ aws --region us-east-1 ecr create-repository --repository-name tiledbvcf
{
"repository": {
"registryId": "<reg-id>",
"repositoryName": "tiledbvcf",
"repositoryArn": "arn:aws:ecr:us-east-1:<reg-id>:repository/tiledbvcf",
"createdAt": 1543352113.0,
"repositoryUri": "<reg-id>.dkr.ecr.us-east-1.amazonaws.com/tiledbvcf"
}
}
$ $(aws ecr get-login --no-include-email --region us-east-1)
$ docker tag tiledbvcf:latest <reg-id>.dkr.ecr.us-east-1.amazonaws.com/tiledbvcf:latest
$ docker push <reg-id>.dkr.ecr.us-east-1.amazonaws.com/tiledbvcf:latest

Where <reg-id> refers to the value of registryId.

3. Create the Compute Environment

On the AWS web console, find the Batch service and click "Create compute environment". Here are some example configuration choices:

  • Allowed instance types: m5.4xlarge

  • Minimum vCPUs: 0

  • Desired vCPUs: 0

  • Maximum vCPUs: 512

Check the "Enable user-specified AMI ID" box and paste the ID of the AMI you created in Step 1.

4. Create the Job Definition

The following definition specifies 16 vCPUs and 64 GB of memory should be available to jobs, and a 24 hour timeout.

$ aws --region us-east-1 batch register-job-definition \
--job-definition-name tiledbvcf-24h \
--type container \
--timeout '{"attemptDurationSeconds": 86400}' \
--container-properties '{"image": "<reg-id>.dkr.ecr.us-east-1.amazonaws.com/tiledbvcf:latest", "vcpus": 16, "memory": 65536, "volumes": [{"host": {"sourcePath": "/data"}, "name": "data"}], "mountPoints": [{"containerPath": "/data", "readOnly": false, "sourceVolume": "data"}]}'

Set the image value to the ID of the image you created in step 2. Note that the mount point path (/data) used in this job definition is what the scratch_path value should be set to in batch-ingest.py.

Next, using the AWS web console, create an IAM role the tasks will inherit to access S3:

  • Service using role: "Elastic Container Service"

  • Use case: "Elastic Container Service Task"

  • Attached policy: "AmazonS3FullAccess"

Give the role a descriptive name such as tiledb-vcf-batch-s3-access. Finally, find the job definition in the web console (in the Batch service). Create a new revision of the job definition, and attach the IAM role.

5. Create the Job Queue

Create the job queue, replacing <compute-env-arn> with the ARN of the compute environment you created in Step 3.

$ aws --region us-east-1 batch create-job-queue \
--job-queue-name tiledbvcf-job-queue \
--state ENABLED \
--priority 1 \
--compute-environment-order '[{"computeEnvironment": "<compute-env-arn>", "order": 1}]'

You should now be ready to launch batch jobs with batch-ingest.py and batch-export.py. Both of these scripts take the job queue and definition names as required arguments.