1 of 7

Backends

TileDB Open Source abstracts all the supported backends. It takes appropriate care to optimize IO for each backend using the corresponding SDK, and nicely abstracts the backends behind a , so it is fairly easy for us to add new backend support in the future (e.g., NVMe) properly optimizing for it. Currently the following backends are supported (in addition to any POSIX local filesystem):

AWS S3

Quickstart

If you have already:

saved your AWS credentials using the AWS CLI aws login command
installed TileDB-Py with pip install tiledb or mamba install -c conda-forge tiledb-py

Then the following should work for any bucket in the us-east-1 region:

import tiledb
# 
tiledb.open("s3://<bucket name>/array_name

If the bucket is in another region, then use:

import tiledb
tiledb.default_ctx({
    "vfs.s3.region": "us-west-2" # replace with bucket region
})
a = tiledb.open("s3://<bucket name>/array_name")

Configuration Overview

This is a simple guide that demonstrates how to use TileDB on AWS S3. After configuring TileDB to work with AWS S3, your TileDB programs will function properly without any API change! Instead of using local file system paths for referencing files (e.g. arrays, groups, VFS files) use must format your URIs to start with s3://. For instance, if you wish to create (and subsequently write/read) an array on AWS S3, you use URI s3://<your-bucket>/<your-array-name> for the array name.

AWS Setup

First, we need to set up an AWS account and generate access keys.

Create a new AWS account.
Visit the AWS console and sign in.
On the AWS console, click on the Services drop-down menu and select Storage->S3. You can create S3 buckets there.
On the AWS console, click on the Services drop-down menu and select Security, Identity & Compliance->IAM.
Click on Users from the left-hand side menu, and then click on the Add User button. Provide the email or username of the user you wish to add, select the Programmatic Access checkbox and click on Next: Permissions.
Click on the Attach existing policies directly button, search for the S3-related policies and add the policy of your choice (e.g., full-access, read-only, etc.). Click on Next and then Create User.
1. Using TileDB with an existing bucket requires at least the following S3 permissions:
  s3:ListBucket s3:GetObject s3:PutObject s3:ListBucketMultipartUploads s3:AbortMultipartUpload s3:ListMultipartUploadParts s3:DeleteObject
Upon successful creation, the next page will show the user along with two keys: Access key ID and Secret access key. Write down both these keys.

AWS Security Credentials

There are multiple supported ways to access resources on AWS. You can either request access with long-term access credentials (e.g. Access Keys) or temporary ones by utilizing AWS Security Token Service.

Access Keys

Access keys are long-term credentials for an IAM user or the AWS account root user. To be able to access AWS resource this way you need to follow the next steps:

1. Export these keys to your environment from a console:

export AWS_ACCESS_KEY_ID=<your-access-key-id>
export AWS_SECRET_ACCESS_KEY=<your-secret-access-key>

$env:AWS_ACCESS_KEY_ID = "<your-access-key-id>"
$env:AWS_SECRET_ACCESS_KEY = "<your-secret-access-key>"

set AWS_ACCESS_KEY_ID=<your-access-key-id>
set AWS_SECRET_ACCESS_KEY=<your-secret-access-key>

2. Or, set the following keys in a configuration object (see Configuration):

Parameter

Default Value

"vfs.s3.aws_access_key_id"

""

"vfs.s3.aws_secret_access_key"

""

Session Tokens

TileDB (version 1.8+) supports authentication with temporary credentials from the AWS Session Token. This method of acquiring temporary credentials is preferred in case you want to maintain permissions solely within your organization

Parameter

Values

"vfs.s3.aws_session_token"

(session token corresponding to the configured key/secret pair)

Assume Role

TileDB (version 2.1+) supports authentication with temporary credentials from the AWS Assume Role. If you prefer to maintain permissions within AWS, this method's base permissions for the temporary credentials will be derived from the policy on a role. You can use them to access AWS resources, that you might not normally have access to. In this case you will need to configure the following:

Parameter

Default Value

"vfs.s3.aws_role_arn"

Required - (The Amazon Resource Name (ARN) of the role to assume)

"vfs.s3.aws_session_name"

Optional - (An identifier for the assumed role session)

"vfs.s3.aws_external_id"

Optional - (A unique identifier that might be required when you assume a role in another account)

"vfs.s3.aws_load_freq"

Optional - (The duration, in minutes, of the role session)

Using this method, the IAM user credentials used by your proxy server only requires the ability to call sts:AssumeRole. There is one extra step of creating a new role and attaching trust policy to it, which is not the case with Session Tokens approach.

Now you are ready to start writing TileDB programs! When creating a TileDB context or a VFS object, you need to set up a configuration object with the following parameters for AWS S3 (supposing that your S3 buckets are on region us-east- - you can set an arbitrary region).

Parameter

Value

"vfs.s3.scheme"

"https"

"vfs.s3.region"

"us-east-1"

"vfs.s3.endpoint_override"

""

"vfs.s3.use_virtual_addressing"

"true"

The above configuration parameters are currently set as shown in TileDB by default. However, we suggest to always check whether the default values are the desired ones for your application.

Physical Organization

So far we explained that TileDB arrays and groups are stored as directories. There is no directory concept on S3 and other similar object stores. However, S3 uses character / in the object URIs which allows the same conceptual organization as a directory hierarchy in local storage. At a physical level, TileDB stores on S3 all the files it would create locally as objects. For instance, for array s3://bucket/path/to/array, TileDB creates array schema object s3://bucket/path/to/array/__array_schema.tdb, fragment metadata object s3://bucket/path/to/array/<fragment>/__fragment_metadata.tdb, and similarly all the other files/objects. Since there is no notion of a “directory” on S3, nothing special is persisted on S3 for directories, e.g., s3://bucket/path/to/array/<fragment>/ does not exist as an object.

The AWS S3 CLI allows you to sync (i.e., download) the S3 objects having a common URI prefix to local storage, organizing them into a directory hierarchy based on the use of / in the object URIs. This makes it very easy to clone TileDB arrays or entire groups locally from S3. For instance, given an array my_array you created and wrote on an S3 bucket my_bucket, you can clone it locally to an array my_local_array with the following command from your console:

$ aws s3 sync s3://my_bucket/my_array my_local_array

After downloading an array locally, your TileDB program will function properly by changing the array name from s3://my_bucket/my_array to my_local_array, without any other modification.

Performance

TileDB writes the various fragment files as append-only objects using the multi-part upload API of the AWS C++ SDK. In addition to enabling appends, this API renders the TileDB writes to S3 particularly amenable to optimizations via parallelization. Since TileDB updates arrays only by writing (appending to) new files (i.e., it never updates a file in-place), TileDB does not need to download entire objects, update them, and re-upload them to S3. This leads to excellent write performance.

TileDB reads utilize the range GET request API of the AWS SDK, which retrieves only the requested (contiguous) bytes from a file/object, rather than downloading the entire file from the cloud. This results in extremely fast subarray reads, especially because of the array tiling. Recall that a tile (which groups cell values that are stored contiguously in the file) is the atomic unit of IO. The range GET API enables reading each tile from S3 in a single request. Finally, TileDB performs all reads in parallel using multiple threads, which is a tunable configuration parameter.

Advanced

Proxy Server Settings

The AWS backend supports several settings for proxy servers:

It is necessary to override "vfs.s3.proxy_scheme" to `http` for most proxy setups. TileDB currently 2.0.8 uses the default setting, which will be updated in the next release.

Parameter

Default

Description

"vfs.s3.proxy_host"

""

The S3 proxy host.

"vfs.s3.proxy_port"

"0"

The S3 proxy port.

"vfs.s3.proxy_scheme"

"https"

The S3 proxy scheme.

"vfs.s3.proxy_username"

""

The S3 proxy username.

"vfs.s3.proxy_password"

""

The S3 proxy password.

Logging

TileDB uses the AWS C++ SDK and cURL for access to S3. The AWS SDK logging level is a process-global setting. The configuration of the most recently constructed context will set the process state. Log files are written to the process working directory.

Parameter

Values

"vfs.s3.logging_level"

[OFF], TRACE, DEBUG

Certificate Paths

There is no universal location for SSL/TLS certificates on Linux. While TileDB searches for the default CA store on several major distributions, other systems or custom certificates may require path specification. to SSL/TLS certificate file to be used by cURL for for S3 HTTPS encryption. These parameters follow cURL conventions: https://curl.haxx.se/docs/manpage.html

Parameter

Value

"vfs.s3.ca_file"

[file path]

"vfs.s3.ca_path"

[directory path]

For debugging purposes only it is possible to disable SSL/TLS certificate verification:

Parameter

Default

"vfs.s3.verify_ssl"

[false], true

Azure Blob Storage

This is a simple guide that demonstrates how to use TileDB on Azure Blob Storage. After configuring TileDB to work with Azure, your TileDB programs will function properly without any API change! Instead of using local file system paths for referencing files (e.g. arrays, groups, VFS files) use must format your URIs to start with azure://. For instance, if you wish to create (and subsequently write and read) an array on Azure, you use a URI of the format azure://<storage-container>/<your-array-name> for the array name.

TileDB does not support .

Setup

Sign into the Azure portal, create a new account if necessary.
On the Azure portal, click on the Storage accounts service.
Select the +Add button to navigate to the Create a storage account form.
Complete the form and create the storage account. You may use a Standard or Premium Block Blob account type.
In your application, set the "vfs.azure.storage_account_name" config option or the AZURE_STORAGE_ACCOUNT environment variable to the name of your storage account name. Alternatively, you can directly .

Authenticating to Azure

TileDB supports authenticating to Azure through Microsoft Entra ID, access keys and shared access signature tokens.

Microsoft Entra ID

- Only system-assigned managed identities are currently supported.

When the Azure backend gets initialized, it attempts to obtain credentials by the sources above. If no credentials can be obtained, TileDB will fall back to anonymous authentication.

Manually selecting which authentication method to use is not currently supported.

Microsoft Entra ID will not be used if any of the following conditions apply:

The vfs.azure.storage_account_key or vfs.azure.storage_sas_token configuration options are specified.
The AZURE_STORAGE_KEY or AZURE_STORAGE_SAS_TOKEN environment variables are specified.

TileDB does not currently support the following features when connecting to Azure with Microsoft Entra ID:

Selecting a specific credentials source without trying to authenticate with the others.
Authenticating with a service principal specified in config options instead of environment variables.

Shared key

Once your storage account has been created, navigate to its landing page. From the left menu, select the Access keys option. Copy the Storage account name and one of the auto-generated Keys.

Shared access signature

Navigate to the new storage account landing page. From the left menu, select the “Shared Access Signature” option.
Use all checked defaults, and select Allowed resource types → Container
Set an appropriate expiration date (note: SAS tokens cannot be revoked)
Click Generate SAS and connection string
Copy the SAS Token (second entry) and use in the TileDB config or environment variable:

Physical Organization

So far we explained that TileDB arrays and groups are stored as directories. There is no directory concept on Azure Blob Storage (similar to other popular object stores). However, Azure uses character / in the object URIs which allows the same conceptual organization as a directory hierarchy in local storage. At a physical level, TileDB stores all files on Azure that it would create locally as objects. For instance, for array azure://container/path/to/array, TileDB creates array schema object azure://container/path/to/array/__array_schema.tdb, fragment metadata object azure://container/path/to/array/<fragment>/__fragment_metadata.tdb, and similarly all the other files/objects. Since there is no notion of a “directory” on Azure, nothing special is persisted on Azure for directories, e.g., azure://container/path/to/array/<fragment>/ does not exist as an object.

Performance

TileDB reads utilize the range GET blob request API of the Azure SDK, which retrieves only the requested (contiguous) bytes from a file/object, rather than downloading the entire file from the cloud. This results in extremely fast subarray reads, especially because of the array tiling. Recall that a tile (which groups cell values that are stored contiguously in the file) is the atomic unit of IO. The range GET API enables reading each tile from Azure in a single request. Finally, TileDB performs all reads in parallel using multiple threads, which is a tunable configuration parameter.

Advanced

By default, the blob endpoint will be set to https://foo.blob.core.windows.net, where foo is the storage account name, as set by the "vfs.azure.storage_account_name" config option, or the AZURE_STORAGE_ACCOUNT environment variable. You can use the "vfs.azure.blob_endpoint" config parameter to override the default blob endpoint.

Google Cloud Storage

This is a simple guide that demonstrates how to use TileDB on Google Cloud Storage (GCS). After setting up TileDB to work with GCS, your TileDB programs will function properly without any API change! Instead of using local file system paths when creating/accessing groups, arrays, and VFS files: use URIs that start with gcs://. For instance, if you wish to create (and subsequently write/read) an array on GCS, you use URI gcs://<your-bucket>/<your-array-name> for the array name.

Configuration

Application Default Credentials

TileDB supports authenticating to Google Cloud using . Authentication happens automatically if your application is running on Google Cloud or in your local environment and you have authenticated with the command. In other cases, you can set the GOOGLE_APPLICATION_CREDENTIALS environment variable to a credentials file, like a .

Manually provided credentials

For more control, you can manually specify strings with the content of credentials files as a config option. TileDB supports the following types of credentials:

Config option

Description

If any of the above options are specified, will not be considered. If multiple options are specified, the one earlier in the table will be used.

Service account impersonation

You can connect to Google Cloud while , by setting the vfs.gcs.impersonate_service_account config option to either the name of a single service account, or a comma-separated sequence of service accounts, for .

The impersonation will be performed using the credentials configured by one of the above methods.

Additional configuration

The following config options are additionally available:

Physical Organization

So far we explained that TileDB arrays and groups are stored as directories. There is no directory concept on GCS and other similar object stores. However, GCS uses character / in the object URIs which allows the same conceptual organization as a directory hierarchy in local storage. At a physical level, TileDB stores on GCS all the files it would create locally as objects. For instance, for array gcs://bucket/path/to/array, TileDB creates array schema object gcs://bucket/path/to/array/__array_schema.tdb, fragment metadata object gcs://bucket/path/to/array/<fragment>/__fragment_metadata.tdb, and similarly all the other files/objects. Since there is no notion of a “directory” on GCS, nothing special is persisted on GCS for directories, e.g., gcs://bucket/path/to/array/<fragment>/ does not exist as an object.

Performance

TileDB reads utilize the range GET request API of the GCS SDK, which retrieves only the requested (contiguous) bytes from a file/object, rather than downloading the entire file from the cloud. This results in extremely fast subarray reads, especially because of the array tiling. Recall that a tile (which groups cell values that are stored contiguously in the file) is the atomic unit of IO. The range GET API enables reading each tile from GCS in a single request. Finally, TileDB performs all reads in parallel using multiple threads, which is a tunable configuration parameter.

Using GCS via the S3 compatibility API

While TileDB provides a native GCS backend implementation using the Google Cloud C++ SDK, it is also possible to use GCS via the GCS-S3 compatibility API using our S3 backend. Doing so requires setting several configuration parameters:

Note: vfs.s3.use_multipart_upload=true may work with recent GCS updates, but has not yet been tested/evaluated by TileDB.

Full example for GCS via S3 compatibility in Python:

import tiledb
import numpy as np
import sys

# update this
#uri = "s3://your-bucket/array-path"
uri = "s3://isaiah-test-bucket/test-11apr2023"

# read credentials from 'creds.nogit' file in current
# directory, newline separated:
#   "key\nsecret"
creds_path = "creds.nogit"
key,secret = [x.strip() for x in open(creds_path).readlines()]

# gcs config
config = tiledb.Config()
config["vfs.s3.endpoint_override"] = "storage.googleapis.com"
config["vfs.s3.aws_access_key_id"] = key
config["vfs.s3.aws_secret_access_key"] = secret
config["vfs.s3.region"] = "auto"
config["vfs.s3.use_multipart_upload"] = "false"

# context
ctx = tiledb.Ctx(config=config)

# create sample array if it does not exist
vfs = tiledb.VFS(ctx=ctx)
if not vfs.is_dir(uri):
  print("trying to write: ", uri)
  a = np.arange(5)
  schema = tiledb.schema_like(a, ctx=ctx)
  tiledb.Array.create(uri, schema)
  with tiledb.DenseArray(uri, 'w', ctx=ctx) as T:
    T[:] = a 

print("reading back from: ", uri)
with tiledb.DenseArray(uri, ctx=ctx) as t:
  print(t[:])

MinIO

MinIO is a lightweight S3-compliant object-store. Although it has many nice features, here we focus only on local deployment, which is very useful if you wish to quickly test your TileDB-S3 programs locally. See the MinIO quickstart guide for installation instructions. Here is what we do to run MinIO on port 9999:

$ mkdir -p /tmp/minio-data
$ docker run -e MINIO_ACCESS_KEY=minio -e MINIO_SECRET_KEY=miniosecretkey \
      -d -p 9999:9000 minio/minio server /tmp/minio-data
$ export AWS_ACCESS_KEY_ID=minio
$ export AWS_SECRET_ACCESS_KEY=miniosecretkey

Once you get MinIO server running, you need to set the S3 configurations as follows (below, <port> stands for the port on which you are running the MinIO server, equal to 9999 if you run the MinIO docker as shown above):

Parameter

Value

"vfs.s3.scheme"

"http"

"vfs.s3.region"

""

"vfs.s3.endpoint_override"

"localhost:<port>"

"vfs.s3.use_virtual_addressing"

"false"

The Configuration page explains in detail how to set configuration parameters in TileDB. Below is a quick example for the minio specific parameters.

// Create a configuration object
tiledb_config_t *config;
tiledb_config_alloc(&config, NULL);

// Set configuration parameters
tiledb_config_set(config, "vfs.s3.scheme", "http", &error);
tiledb_config_set(config, "vfs.s3.region", "", &error);
tiledb_config_set(config, "vfs.s3.endpoint_override", "<minio address>", &error);
tiledb_config_set(config, "vfs.s3.use_virtual_addressing", "false", &error);

// Create contex
tiledb_ctx_t* ctx;
tiledb_ctx_alloc(&config, &ctx);

// Pass CTX to read/write of array

// Clean up
tiledb_config_free(&config);
tiledb_ctx_free(&ctx);

// Create a configuration object
Config config;

// Set a configuration parameter
config["vfs.s3.scheme"] = "http";
config["vfs.s3.region"] = "";
config["vfs.s3.endpoint_override"] = "<minio address>";
config["vfs.s3.use_virtual_addressing"] = "false";

// Create contex
Ctx ctx(config);

// Pass CTX to read/write of array

# Create a configuration object
config = tiledb.Config()

# Set configuration parameters
config["vfs.s3.scheme"] = "http"
config["vfs.s3.region"] = ""
config["vfs.s3.endpoint_override"] = "<minio address>"
config["vfs.s3.use_virtual_addressing"] = "false"

# Create contex
ctx = tiledb.Ctx(config)

# Pass CTX to read/write of array
with tiledb.open(array_uri, ctx=ctx) as A:
  # Read or write and context with minio is applied

# Create a configuration object
config <- tiledb_config()

# Set a configuration parameter
config["vfs.s3.scheme"] = "http"
config["vfs.s3.region"] = ""
config["vfs.s3.endpoint_override"] = "<minio address>"
config["vfs.s3.use_virtual_addressing"] = "false"

ctx <- tiledb_ctx(config)

# Pass CTX to read/write of array

try(Config config = new Config()) {
    // Set configuration parameters
    config.set("vfs.s3.scheme", "http");
    config.set("vfs.s3.region", "");
    config.set("vfs.s3.endpoint_override", "<minio address>");
    config.set("vfs.s3.use_virtual_addressing", "false");
    
    // set config on context
    try(Context ctx = new Context(config)) {
      // Pass CTX to read/write of array
    }
}

// Create a configuration object
config, _ := tiledb.NewConfig()

// Set configuration parameters
config.Set("vfs.s3.scheme", "http")
config.Set("vfs.s3.region", "")
config.Set("vfs.s3.endpoint_override", "<minio address>")
config.Set("vfs.s3.use_virtual_addressing", "false")

// set config on context
ctx, _ := tiledb.NewCtx(&config)

// Pass CTX to read/write of array

// Clean up
config.Free()
ctx.Free()

HDFS

This is a simple guide that demonstrates how to use TileDB on HDFS. HDFS is a distributed Java-based filesystem for storing large amounts of data. It is the underlying distributed storage layer for the Hadoop stack.

The HDFS backend currently only works on POSIX (Linux, macOS) platforms. Windows is currently not supported.

TileDB integrates with HDFS through the libhdfs library (HDFS C-API). The HDFS backend is enabled by default and libhdfs loading happens at runtime based on environment variables:

Env variable

Description

HADOOP_HOME

The root of your installed Hadoop distribution. TileDB will search the path $HADOOP_HOME/lib/native to find the libhdfs shared library.

JAVA_HOME

The location of the Java SDK installation. The JAVA_HOME variable may be set to the root path for the JAVA SDK, the JRE path, or to the directory containing the libjvm library.

CLASSPATH

The Java classpath including the Hadoop jar files. The correct CLASSPATH variable can be set directly using the hadoop utility: CLASSPATH=$HADOOP_HOME/bin/hadoop classpath –glob

If the library cannot be found, or if the Hadoop library cannot locate the correct library dependencies a runtime, an error will be returned.

To use HDFS with TileDB, change the URI you use to an HDFS path:

hdfs://<authority>:<port>/path/to/array

For instance, if you are running a local HDFS namenode on port 9000:

hdfs://localhost:9000/path/to/array

If you want to use the namenode specified in your HDFS configuration files, then change the prefix to:

hdfs:///path/to/array or hdfs://default/path/to/array

Most HDFS configuration variables are defined in Hadoop specific XML files. TileDB allows the following configuration variables to be set at run time through configuration parameters:

Parameter

Description

vfs.hdfs.username

Optional runtime username to use when connecting to the HDFS cluster.

vfs.hdfs.name_node_uri

Optional namenode URI to use (TileDB will use "default" if not specified). URI must be specified in the format <protocol>://<hostname>:<port>, ex: hdfs://localhost:9000. If the string starts with a protocol type such as file:// or s3://, this protocol will be used instead of the default hdfs://.

vfs.hdfs.kerb_ticket_cache_path

Path to the Kerberos ticket cache when connecting to an HDFS cluster.

The Configuration page explains how to set configuration parameters in TileDB.

Lustre

Lustre is a POSIX-compliant distributed filesystem and, therefore, TileDB "just works" on Lustre. Care must be taken to enable file locking for process-safety during consolidation (see Concurrency for more details).

Lustre supports POSIX file locking semantics and exposes local- (mount with -o localflock) and cluster- (mount with -o flock) level locking.

Azure Blob Storage

TileDB does not support .

Setup

Sign into the Azure portal, create a new account if necessary.
On the Azure portal, click on the Storage accounts service.
Select the +Add button to navigate to the Create a storage account form.
Complete the form and create the storage account. You may use a Standard or Premium Block Blob account type.
In your application, set the "vfs.azure.storage_account_name" config option or the AZURE_STORAGE_ACCOUNT environment variable to the name of your storage account name. Alternatively, you can directly .

Authenticating to Azure

TileDB supports authenticating to Azure through Microsoft Entra ID, access keys and shared access signature tokens.

Microsoft Entra ID

Microsoft Entra ID is and provides superior security and fine-grained access compared to shared keys. It is enabled by default and you do not need to specifically configure TileDB to use it. Credentials are obtained automatically from the following sources in order:

for Azure compute resources
- Only system-assigned managed identities are currently supported.
for Kubernetes

When the Azure backend gets initialized, it attempts to obtain credentials by the sources above. If no credentials can be obtained, TileDB will fall back to anonymous authentication.

Manually selecting which authentication method to use is not currently supported.

Microsoft Entra ID will not be used if any of the following conditions apply:

The vfs.azure.storage_account_key or vfs.azure.storage_sas_token configuration options are specified.
The AZURE_STORAGE_KEY or AZURE_STORAGE_SAS_TOKEN environment variables are specified.
A is specified that is not using HTTPS.

TileDB does not currently support the following features when connecting to Azure with Microsoft Entra ID:

Selecting a specific credentials source without trying to authenticate with the others.
Authenticating with a service principal specified in config options instead of environment variables.
Authenticating with a .

Make sure to assign the right roles to the identity to use with TileDB. The general and roles do not provide access to data inside the storage accounts. You need to assign the or the roles in order to read or write data respectively.

Shared key

Authentication with shared keys is considered insecure. You are recommended to use .

Once your storage account has been created, navigate to its landing page. From the left menu, select the Access keys option. Copy the Storage account name and one of the auto-generated Keys.
Set the following keys in a configuration object (see ) or environment variable. Use the storage account name and key from the last step.
Parameter
Environment variable
Default Value

Shared access signature

Navigate to the new storage account landing page. From the left menu, select the “Shared Access Signature” option.
Use all checked defaults, and select Allowed resource types → Container
Set an appropriate expiration date (note: SAS tokens cannot be revoked)
Click Generate SAS and connection string
Copy the SAS Token (second entry) and use in the TileDB config or environment variable:

Parameter

Environment variable

Default Value

Physical Organization

Performance

TileDB writes the various fragment files as append-only objects using the block-list upload API of the . In addition to enabling appends, this API renders the TileDB writes to Azure particularly amenable to optimizations via parallelization. Since TileDB updates arrays only by writing (appending to) new files (i.e., it never updates a file in-place), TileDB does not need to download entire objects, update them, and re-upload them to Azure. This leads to excellent write performance.

Advanced

Parameter

Default Value

If the custom endpoint contains a , the "vfs.azure.storage_sas_token" option must not be specified.

AWS S3

Quickstart

If you have already:

saved your AWS credentials using the AWS CLI aws login command
installed TileDB-Py with pip install tiledb or mamba install -c conda-forge tiledb-py

Then the following should work for any bucket in the us-east-1 region:

import tiledb
# 
tiledb.open("s3://<bucket name>/array_name

If the bucket is in another region, then use:

import tiledb
tiledb.default_ctx({
    "vfs.s3.region": "us-west-2" # replace with bucket region
})
a = tiledb.open("s3://<bucket name>/array_name")

Configuration Overview

AWS Setup

First, we need to set up an AWS account and generate access keys.

Create a new AWS account.
Visit the AWS console and sign in.
On the AWS console, click on the Services drop-down menu and select Storage->S3. You can create S3 buckets there.
On the AWS console, click on the Services drop-down menu and select Security, Identity & Compliance->IAM.
Click on Users from the left-hand side menu, and then click on the Add User button. Provide the email or username of the user you wish to add, select the Programmatic Access checkbox and click on Next: Permissions.
Click on the Attach existing policies directly button, search for the S3-related policies and add the policy of your choice (e.g., full-access, read-only, etc.). Click on Next and then Create User.
1. Using TileDB with an existing bucket requires at least the following S3 permissions:
  s3:ListBucket s3:GetObject s3:PutObject s3:ListBucketMultipartUploads s3:AbortMultipartUpload s3:ListMultipartUploadParts s3:DeleteObject
Upon successful creation, the next page will show the user along with two keys: Access key ID and Secret access key. Write down both these keys.

(see the TileDB Cloud documentation for an providing these permissions)

AWS Security Credentials

Access Keys

Access keys are long-term credentials for an IAM user or the AWS account root user. To be able to access AWS resource this way you need to follow the next steps:

1. Export these keys to your environment from a console:

export AWS_ACCESS_KEY_ID=<your-access-key-id>
export AWS_SECRET_ACCESS_KEY=<your-secret-access-key>

$env:AWS_ACCESS_KEY_ID = "<your-access-key-id>"
$env:AWS_SECRET_ACCESS_KEY = "<your-secret-access-key>"

set AWS_ACCESS_KEY_ID=<your-access-key-id>
set AWS_SECRET_ACCESS_KEY=<your-secret-access-key>

2. Or, set the following keys in a configuration object (see Configuration):

Parameter

Default Value

"vfs.s3.aws_access_key_id"

""

"vfs.s3.aws_secret_access_key"

""

Session Tokens

Parameter

Values

"vfs.s3.aws_session_token"

(session token corresponding to the configured key/secret pair)

Assume Role

Parameter

Default Value

"vfs.s3.aws_role_arn"

Required - (The Amazon Resource Name (ARN) of the role to assume)

"vfs.s3.aws_session_name"

Optional - (An identifier for the assumed role session)

"vfs.s3.aws_external_id"

Optional - (A unique identifier that might be required when you assume a role in another account)

"vfs.s3.aws_load_freq"

Optional - (The duration, in minutes, of the role session)

Parameter

Value

"vfs.s3.scheme"

"https"

"vfs.s3.region"

"us-east-1"

"vfs.s3.endpoint_override"

""

"vfs.s3.use_virtual_addressing"

"true"

The above configuration parameters are currently set as shown in TileDB by default. However, we suggest to always check whether the default values are the desired ones for your application.

Physical Organization

$ aws s3 sync s3://my_bucket/my_array my_local_array

After downloading an array locally, your TileDB program will function properly by changing the array name from s3://my_bucket/my_array to my_local_array, without any other modification.

Performance

Advanced

Proxy Server Settings

The AWS backend supports several settings for proxy servers:

It is necessary to override "vfs.s3.proxy_scheme" to `http` for most proxy setups. TileDB currently 2.0.8 uses the default setting, which will be updated in the next release.

Parameter

Default

Description

"vfs.s3.proxy_host"

""

The S3 proxy host.

"vfs.s3.proxy_port"

"0"

The S3 proxy port.

"vfs.s3.proxy_scheme"

"https"

The S3 proxy scheme.

"vfs.s3.proxy_username"

""

The S3 proxy username.

"vfs.s3.proxy_password"

""

The S3 proxy password.

Logging

Parameter

Values

"vfs.s3.logging_level"

[OFF], TRACE, DEBUG

Certificate Paths

Parameter

Value

"vfs.s3.ca_file"

[file path]

"vfs.s3.ca_path"

[directory path]

For debugging purposes only it is possible to disable SSL/TLS certificate verification:

Parameter

Default

"vfs.s3.verify_ssl"

[false], true