AWS S3

Last updated 9 months ago

Was this helpful?

AWS S3

Quickstart

If you have already:

saved your AWS credentials using the AWS CLI aws login command
with pip install tiledb or mamba install -c conda-forge tiledb-py

Then the following should work for any bucket in the us-east-1 region:

import tiledb
# 
tiledb.open("s3://<bucket name>/array_name

If the bucket is in another region, then use:

import tiledb
tiledb.default_ctx({
    "vfs.s3.region": "us-west-2" # replace with bucket region
})
a = tiledb.open("s3://<bucket name>/array_name")

Configuration Overview

This is a simple guide that demonstrates how to use TileDB on AWS S3. After configuring TileDB to work with AWS S3, your TileDB programs will function properly without any API change! Instead of using local file system paths for referencing files (e.g. arrays, groups, VFS files) use must format your URIs to start with s3://. For instance, if you wish to create (and subsequently write/read) an array on AWS S3, you use URI s3://<your-bucket>/<your-array-name> for the array name.

AWS Setup

First, we need to set up an AWS account and generate access keys.

On the AWS console, click on the Services drop-down menu and select Storage->S3. You can create S3 buckets there.
On the AWS console, click on the Services drop-down menu and select Security, Identity & Compliance->IAM.
Click on Users from the left-hand side menu, and then click on the Add User button. Provide the email or username of the user you wish to add, select the Programmatic Access checkbox and click on Next: Permissions.
Click on the Attach existing policies directly button, search for the S3-related policies and add the policy of your choice (e.g., full-access, read-only, etc.). Click on Next and then Create User.
1. Using TileDB with an existing bucket requires at least the following S3 permissions:
  s3:ListBucket s3:GetObject s3:PutObject s3:ListBucketMultipartUploads s3:AbortMultipartUpload s3:ListMultipartUploadParts s3:DeleteObject
Upon successful creation, the next page will show the user along with two keys: Access key ID and Secret access key. Write down both these keys.

AWS Security Credentials

Access Keys

Access keys are long-term credentials for an IAM user or the AWS account root user. To be able to access AWS resource this way you need to follow the next steps:

1. Export these keys to your environment from a console:

export AWS_ACCESS_KEY_ID=<your-access-key-id>
export AWS_SECRET_ACCESS_KEY=<your-secret-access-key>

$env:AWS_ACCESS_KEY_ID = "<your-access-key-id>"
$env:AWS_SECRET_ACCESS_KEY = "<your-secret-access-key>"

set AWS_ACCESS_KEY_ID=<your-access-key-id>
set AWS_SECRET_ACCESS_KEY=<your-secret-access-key>

Parameter

Default Value

"vfs.s3.aws_access_key_id"

""

"vfs.s3.aws_secret_access_key"

""

Session Tokens

Parameter

Values

"vfs.s3.aws_session_token"

(session token corresponding to the configured key/secret pair)

Assume Role

Parameter

Default Value

"vfs.s3.aws_role_arn"

Required - (The Amazon Resource Name (ARN) of the role to assume)

"vfs.s3.aws_session_name"

Optional - (An identifier for the assumed role session)

"vfs.s3.aws_external_id"

Optional - (A unique identifier that might be required when you assume a role in another account)

"vfs.s3.aws_load_freq"

Optional - (The duration, in minutes, of the role session)

Using this method, the IAM user credentials used by your proxy server only requires the ability to call sts:AssumeRole. There is one extra step of creating a new role and attaching trust policy to it, which is not the case with Session Tokens approach.

Now you are ready to start writing TileDB programs! When creating a TileDB context or a VFS object, you need to set up a configuration object with the following parameters for AWS S3 (supposing that your S3 buckets are on region us-east- - you can set an arbitrary region).

Parameter

Value

"vfs.s3.scheme"

"https"

"vfs.s3.region"

"us-east-1"

"vfs.s3.endpoint_override"

""

"vfs.s3.use_virtual_addressing"

"true"

The above configuration parameters are currently set as shown in TileDB by default. However, we suggest to always check whether the default values are the desired ones for your application.

Physical Organization

So far we explained that TileDB arrays and groups are stored as directories. There is no directory concept on S3 and other similar object stores. However, S3 uses character / in the object URIs which allows the same conceptual organization as a directory hierarchy in local storage. At a physical level, TileDB stores on S3 all the files it would create locally as objects. For instance, for array s3://bucket/path/to/array, TileDB creates array schema object s3://bucket/path/to/array/__array_schema.tdb, fragment metadata object s3://bucket/path/to/array/<fragment>/__fragment_metadata.tdb, and similarly all the other files/objects. Since there is no notion of a “directory” on S3, nothing special is persisted on S3 for directories, e.g., s3://bucket/path/to/array/<fragment>/ does not exist as an object.

$ aws s3 sync s3://my_bucket/my_array my_local_array

After downloading an array locally, your TileDB program will function properly by changing the array name from s3://my_bucket/my_array to my_local_array, without any other modification.

Performance

TileDB reads utilize the range GET request API of the AWS SDK, which retrieves only the requested (contiguous) bytes from a file/object, rather than downloading the entire file from the cloud. This results in extremely fast subarray reads, especially because of the array tiling. Recall that a tile (which groups cell values that are stored contiguously in the file) is the atomic unit of IO. The range GET API enables reading each tile from S3 in a single request. Finally, TileDB performs all reads in parallel using multiple threads, which is a tunable configuration parameter.

Advanced

Proxy Server Settings

The AWS backend supports several settings for proxy servers:

It is necessary to override "vfs.s3.proxy_scheme" to `http` for most proxy setups. TileDB currently 2.0.8 uses the default setting, which will be updated in the next release.

Parameter

Default

Description

"vfs.s3.proxy_host"

""

The S3 proxy host.

"vfs.s3.proxy_port"

"0"

The S3 proxy port.

"vfs.s3.proxy_scheme"

"https"

The S3 proxy scheme.

"vfs.s3.proxy_username"

""

The S3 proxy username.

"vfs.s3.proxy_password"

""

The S3 proxy password.

Logging

TileDB uses the AWS C++ SDK and cURL for access to S3. The AWS SDK logging level is a process-global setting. The configuration of the most recently constructed context will set the process state. Log files are written to the process working directory.

Parameter

Values

"vfs.s3.logging_level"

[OFF], TRACE, DEBUG

Certificate Paths

Parameter

Value

"vfs.s3.ca_file"

[file path]

"vfs.s3.ca_path"

[directory path]

For debugging purposes only it is possible to disable SSL/TLS certificate verification:

Parameter

Default

"vfs.s3.verify_ssl"

[false], true

PreviousBackends NextAzure Blob Storage

Last updated 9 months ago

Was this helpful?