Comment on page
Azure Blob Storage
In version 2.16.0 the Azure backend has been overhauled. Some configuration options have changed from previous versions.
This is a simple guide that demonstrates how to use TileDB on Azure Blob Storage. After configuring TileDB to work with Azure, your TileDB programs will function properly without any API change! Instead of using local file system paths for referencing files (e.g. arrays, groups, VFS files) use must format your URIs to start with
azure://
. For instance, if you wish to create (and subsequently write and read) an array on Azure, you use a URI of the format azure://<storage-container>/<your-array-name>
for the array name.- 1.Sign into the Azure portal, create a new account if necessary.
- 2.On the Azure portal, click on the
Storage accounts
service. - 3.Select the
+Add
button to navigate to theCreate a storage account
form. - 4.Complete the form and create the storage account. You may use a Standard or Premium Block Blob account type.The Create a storage account form
TileDB supports authenticating to Azure through access keys and shared access signature tokens.
- 1.Once your storage account has been created, navigate to its landing page. From the left menu, select the
Access keys
option. Copy theStorage account name
and one of the auto-generatedKey
s.The Access keys page - 2.Set the following keys in a configuration object (see Configuration). Use the storage account name and key from the last step.ParameterDefault Value
"vfs.azure.storage_account_name"
""
"vfs.azure.storage_account_key"
""
- 1.Navigate to the new storage account landing page. From the left menu, select the “Shared Access Signature” option.
- 2.Use all checked defaults, and select
Allowed resource types
→Container
- 3.Set an appropriate expiration date (note: SAS tokens cannot be revoked)
- 4.Click
Generate SAS and connection string
- 5.Copy the
SAS Token
(second entry) and use in the TileDB config:
Parameter | Default Value |
---|---|
"vfs.azure.storage_sas_token" | "" |
So far we explained that TileDB arrays and groups are stored as directories. There is no directory concept on Azure Blob Storage (similar to other popular object stores). However, Azure uses character
/
in the object URIs which allows the same conceptual organization as a directory hierarchy in local storage. At a physical level, TileDB stores all files on Azure that it would create locally as objects. For instance, for array azure://container/path/to/array
, TileDB creates array schema object azure://container/path/to/array/__array_schema.tdb
, fragment metadata object azure://container/path/to/array/<fragment>/__fragment_metadata.tdb
, and similarly all the other files/objects. Since there is no notion of a “directory” on Azure, nothing special is persisted on Azure for directories, e.g., azure://container/path/to/array/<fragment>/
does not exist as an object.TileDB writes the various fragment files as append-only objects using the block-list upload API of the Azure SDK for C++. In addition to enabling appends, this API renders the TileDB writes to Azure particularly amenable to optimizations via parallelization. Since TileDB updates arrays only by writing (appending to) new files (i.e., it never updates a file in-place), TileDB does not need to download entire objects, update them, and re-upload them to Azure. This leads to excellent write performance.
TileDB reads utilize the range GET blob request API of the Azure SDK, which retrieves only the requested (contiguous) bytes from a file/object, rather than downloading the entire file from the cloud. This results in extremely fast subarray reads, especially because of the array tiling. Recall that a tile (which groups cell values that are stored contiguously in the file) is the atomic unit of IO. The range GET API enables reading each tile from Azure in a single request. Finally, TileDB performs all reads in parallel using multiple threads, which is a tunable configuration parameter.
By default, the blob endpoint will be set to
https://foo.blob.core.windows.net
, where foo
is the storage account name. You can use the "vfs.azure.blob_endpoint"
config parameter to override the default blob endpoint.Parameter | Default Value |
---|---|
"vfs.azure.blob_endpoint" | "" |
Since version 2.16.0 the custom endpoint must include the schema (
http://
or https://
). The "vfs.azure.use_https"
option is deprecated.If the custom endpoint contains a SAS token, the
"vfs.azure.storage_sas_token"
option must not be specified.Last modified 3mo ago