Azure Blob Storage

Last updated 11 months ago

Was this helpful?

Azure Blob Storage

This is a simple guide that demonstrates how to use TileDB on Azure Blob Storage. After configuring TileDB to work with Azure, your TileDB programs will function properly without any API change! Instead of using local file system paths for referencing files (e.g. arrays, groups, VFS files) use must format your URIs to start with azure://. For instance, if you wish to create (and subsequently write and read) an array on Azure, you use a URI of the format azure://<storage-container>/<your-array-name> for the array name.

TileDB does not support .

Setup

Sign into the Azure portal, create a new account if necessary.
On the Azure portal, click on the Storage accounts service.
Select the +Add button to navigate to the Create a storage account form.
Complete the form and create the storage account. You may use a Standard or Premium Block Blob account type.
In your application, set the "vfs.azure.storage_account_name" config option or the AZURE_STORAGE_ACCOUNT environment variable to the name of your storage account name. Alternatively, you can directly .

Authenticating to Azure

TileDB supports authenticating to Azure through Microsoft Entra ID, access keys and shared access signature tokens.

Microsoft Entra ID

- Only system-assigned managed identities are currently supported.

When the Azure backend gets initialized, it attempts to obtain credentials by the sources above. If no credentials can be obtained, TileDB will fall back to anonymous authentication.

Manually selecting which authentication method to use is not currently supported.

Microsoft Entra ID will not be used if any of the following conditions apply:

The vfs.azure.storage_account_key or vfs.azure.storage_sas_token configuration options are specified.
The AZURE_STORAGE_KEY or AZURE_STORAGE_SAS_TOKEN environment variables are specified.

TileDB does not currently support the following features when connecting to Azure with Microsoft Entra ID:

Selecting a specific credentials source without trying to authenticate with the others.
Authenticating with a service principal specified in config options instead of environment variables.

Shared key

Once your storage account has been created, navigate to its landing page. From the left menu, select the Access keys option. Copy the Storage account name and one of the auto-generated Keys.
Parameter
Environment variable
Default Value
"vfs.azure.storage_account_name"
AZURE_STORAGE_ACCOUNT
""
"vfs.azure.storage_account_key"
AZURE_STORAGE_KEY
""

Shared access signature

Navigate to the new storage account landing page. From the left menu, select the “Shared Access Signature” option.
Use all checked defaults, and select Allowed resource types → Container
Set an appropriate expiration date (note: SAS tokens cannot be revoked)
Click Generate SAS and connection string
Copy the SAS Token (second entry) and use in the TileDB config or environment variable:

Parameter

Environment variable

Default Value

"vfs.azure.storage_sas_token"

AZURE_STORAGE_SAS_TOKEN

""

Physical Organization

So far we explained that TileDB arrays and groups are stored as directories. There is no directory concept on Azure Blob Storage (similar to other popular object stores). However, Azure uses character / in the object URIs which allows the same conceptual organization as a directory hierarchy in local storage. At a physical level, TileDB stores all files on Azure that it would create locally as objects. For instance, for array azure://container/path/to/array, TileDB creates array schema object azure://container/path/to/array/__array_schema.tdb, fragment metadata object azure://container/path/to/array/<fragment>/__fragment_metadata.tdb, and similarly all the other files/objects. Since there is no notion of a “directory” on Azure, nothing special is persisted on Azure for directories, e.g., azure://container/path/to/array/<fragment>/ does not exist as an object.

Performance

TileDB reads utilize the range GET blob request API of the Azure SDK, which retrieves only the requested (contiguous) bytes from a file/object, rather than downloading the entire file from the cloud. This results in extremely fast subarray reads, especially because of the array tiling. Recall that a tile (which groups cell values that are stored contiguously in the file) is the atomic unit of IO. The range GET API enables reading each tile from Azure in a single request. Finally, TileDB performs all reads in parallel using multiple threads, which is a tunable configuration parameter.

Advanced

By default, the blob endpoint will be set to https://foo.blob.core.windows.net, where foo is the storage account name, as set by the "vfs.azure.storage_account_name" config option, or the AZURE_STORAGE_ACCOUNT environment variable. You can use the "vfs.azure.blob_endpoint" config parameter to override the default blob endpoint.

Parameter

Default Value

"vfs.azure.blob_endpoint"

""

PreviousAWS S3 NextGoogle Cloud Storage

Last updated 11 months ago

Was this helpful?