TileDB Open Source provides native support for reading from and writing to cloud object stores like AWS S3, Google Cloud Storage, and Microsoft Azure Blob Store. This guide will cover some considerations for using TileDB-VCF with these services. The examples will focus exclusively on S3, which is the most widely used, but note any of the aforementioned services can be substituted, as well as on-premise services like MinIO that provide S3-compatible APIs.
The process of creating a TileDB-VCF dataset on S3 is nearly identical to creating a local dataset. The only difference being an s3://
address is passed to the --uri
argument rather than a local file path.
This also works when querying a TileDB-VCF dataset located on S3.
VCF files located on S3 can be ingested directly into a TileDBVCF dataset using 1 of 2 different possible approaches.
The first approach is the easiest, you simply pass the tiledbvcf store
command a list of S3 URIs and TileDB-VCF takes care of the rest:
In this approach, remote VCF index files (which are relatively tiny) are downloaded locally, allowing TileDB-VCF to retrieve chunks of variant data from the remote VCF files without having to download them in full. By default, index files are downloaded to your current working directory, however, you can choose to store them in different location (e.g., a temporary directory) using the --scratch-dir
argument.
The second approach is to download batches of VCF files in their entirety before ingestion, which may slightly improve ingestion performance. This approach requires allocating TileDB-VCF with scratch disk space using the --scratch-mb
and --scratch-dir
arguments.
The number of VCF files that are downloaded at a time is determined by the --sample-batch-size
parameter, which defaults to 10. Downloading and ingestion happens asynchronously, so, for example, batch 3 will be downloaded as batch 2 is being ingestion. As a result, you must configure enough scratch space to store at least 20 samples, assuming a batch size of 10.
For TileDB to access a remote storage bucket you must be properly authenticated on the machine running TileDB. For S3, this means having access to the appropriate AWS access key ID and secret access key. This typically happens in one of three ways:
If the AWS Command Line Interface (CLI) is installed on your machine, running aws configure
will store your credentials in a local profile that TileDB can access. You can verify the CLI has been previously configured by running:
If properly configured, this will output a list of the S3 buckets you (and thus TileDB) can access.
You can pass your AWS access key ID and secret access key to TileDB-VCF directly via the --tiledb-config
argument, which expects a comma-separated string:
Your AWS credentials can also be passed to TileDB by defining the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.