1 of 81

TileDB Cloud

Welcome to TileDB Cloud!

Data management made universal

What is TileDB Cloud?

TileDB Cloud is the commercial platform built by the TileDB team that allows you and your organization to unify all types of data, automate distributed analysis and pipelines, and securely explore and share data and code, while enjoying extreme interoperability with programming languages and data science tools.

TileDB Cloud takes a radically different approach than the current data management landscape. Instead of dealing with a large number of different data formats and special-purpose databases, TileDB Cloud builds a unified data management stack, storing all types of data in a single unified format, pushing access control, logging and a growing set of computational primitives to storage, and emphasizing on integrations with every popular programming language and computational tool. And it does all that providing a 100% serverless experience to the user.

TileDB Cloud is based on the TileDB Open Source universal storage engine, which models and efficiently stores all data as (dense or sparse) multi-dimensional arrays, providing a common API and a large number of APIs and tool integrations.

Is TileDB Cloud for you?

TileDB Cloud is ideal for you if you struggle with:

Data storage and access
- inefficient files and domain-specific formats
- wrangling data across tools and languages
- controlling and monitoring access to data and code
- sharing data and code at extreme scale
Mixed data and workloads
- different data types (e.g., dataframes, images, genomics, etc)
- combination of SQL, data science and Machine Learning
Scalability and deployment
- scaling analysis easily and inexpensively
- setting up and monitoring machine clusters
- managing numerous disparate data solutions and silos
Finding and contributing public data and code
- sharing files in cloud buckets and code in repos
- easily and reproducibly running code on different data
- monitoring usage stats
Monetizing data and code
- having to build an entire infrastructure to sell data and code

TileDB Cloud (SaaS) currently works only on AWS. We are currently working on adding multi-cloud support, namely for Azure and Google Cloud. See TileDB Cloud Enterprise if you are interested in hosting TileDB Cloud in your own environment.

Use Cases

There is a variety of application domains our team has expertise on and you can use TileDB Cloud for:

Geospatial
- point cloud (LiDAR, SONAR, AIS)
- SAR
- optical imaging
Genomics
- population genomics
- single-cell multi-omics
Dataframes
- any tabular data, accessible with various APIs as well as SQL
Time series
- any data that could benefit from indexing on date/time fields
Biomedical imaging
- any imaging data requiring pyramid structures
Many others
- video, automotive, telecommunications, etc.

Capabilities

Data and code management Manage all you data (modeled as multi-dimensional arrays) and code (UDFs and notebooks) in a single platform.
Access control and sharing Securely share your data and code with access policies.
Logging and auditing See all the activity on your data and code from detailed audit logs.
Organizations Create organizations and define different access policies for data and code.
Jupyter notebooks Create, share and spin up Jupyter notebooks directly in the platform with a few clicks.
Serverless SQL and UDFs You can run any SQL query on any array, without having to provision any clusters. You can also define and register any user-defined function, as well as share it with other, which can be run in a serverless manner (similar to lambdas).
Serverless task graphs Create any pipeline or any sophisticated distributed algorithm with TileDB's task graphs. TileDB executes the various tasks in the graph in parallel respecting the dependencies, without forcing the user to create or define any clusters.
Data and code marketplace Take advantage of TileDB's full-fledged marketplace (integrating with Stripe) and monetize your data and code based on egress or CPU time.

Getting Started

The best way to get started is to sign up and run the Start Here! tutorial. You can find a constantly growing number of tutorials in the TUTORIALS page group found in the left navigation menu of these docs.

If you'd like to take a deep dive into the TileDB Cloud internals, you can navigate to CONCEPTS in the left navigation menu. You can also always consult the HOW TO guides and API REFERENCE.

How to Use the Docs

To make it easy to understand where to find what you are looking for, the documentation is structured in the following sections:

Tutorials A series of steps to address key problems and use cases
Concepts Background information and explanation of key topics and concepts
How To Short how-to guides based on FAQ
API Reference
Technical reference to the client APIs

TileDB Cloud Enterprise

Looking for an on-prem solution?

TileDB Cloud is available as a customer-hosted instance to address enterprise security policies and governance mandates. Learn more about TileDB Cloud Enterprise.

More Help

In case you do not find the information you need in these docs, there is a variety of channels you can get more help from:

Visit our forum and post a question
Join our Slack community and post comments
Request a feature if TileDB Cloud is missing important functionality
You can always contact us through our various communication channels

Tutorials

Start Here!

This page is currently under development and will be updated soon.

In this tutorial you will learn how to:

Sign up / sign in to TileDB Cloud
Access a public array using the TileDB-Py library
Access basic array information using the TileDB Cloud client
View the public array on the TileDB Cloud console
View a task on the TileDB Cloud console
View and edit your profile on the TileDB Cloud console

First, sign up and you will get $10 in free credit. Feel free to contact us if you run out of credits and you are still evaluating TileDB Cloud, we'll be happy to help out with more credits. Then simply sign in with the username and password you created.

Access a Public Array

It is extremely easy to access arrays registered with TileDB Cloud via the core TileDB Open Source APIs, with literally a single change: adding your username/password as configuration parameters. In this tutorial we will use TileDB-Py, TileDB-R and TileDB-Java. Let's first install it with:

# Pip:
$ pip install tiledb

# Or Conda:
$ conda install -c conda-forge tiledb-py

# Via CRAN
install.packages("tiledb")

// Maven
//Include this in your Maven project:
<dependency>
  <groupId>io.tiledb</groupId>
  <artifactId>tiledb-java</artifactId>
  <version>X.X.X</version>
</dependency>

// Or build from source
$ git clone https://github.com/TileDB-Inc/TileDB-Java.git
$ cd TileDB-Java
$ ./gradlew assemble

To validate the installation, run:

$ python
>>> import tiledb
>>> tiledb.__version__
'x.y.z' # The version will appear here

$ R
> library(tiledb)
> packageVersion("tiledb")
'x.y.z' # The version will appear here

// Compile and run:
Version version = new Version();
System.out.println(version);

For Python, also make sure you have already installed pandas and pyarrow, or alternatively run:

$ conda install pandas
$ conda install -c conda-forge pyarrow

We have added a couple of public arrays that every user on TileDB Cloud can access, under our TileDB-Inc organization. Below we show an example for getting the schema of array tiledb://TileDB-Inc/quickstart_sparse, and slicing its contents:

import tiledb

# Set your username and password
config = tiledb.Config()
config["rest.username"] = "xxx"
config["rest.password"] = "yyy"

# This is the array URI format in TileDB Cloud
array_name = "tiledb://TileDB-Inc/quickstart_sparse"

# Open the array
A = tiledb.open(array_name, 'r', ctx=tiledb.Ctx(config))

# Print the array schema
print(A.schema)
    
# This will print:
#
# ArraySchema(
#  domain=Domain(*[
#    Dim(name='rows', domain=(1, 4), tile=4, dtype='int32'),
#    Dim(name='cols', domain=(1, 4), tile=4, dtype='int32'),
#  ]),
#  attrs=[
#    Attr(name='a', dtype='uint32', var=False, nullable=False),
#  ],
#  cell_order='row-major',
#  tile_order='row-major',
#  capacity=10000,
#  sparse=True,
#  allows_duplicates=False,
#  coords_filters=FilterList([ZstdFilter(level=-1)]),
#)

# Print all the contents of the array
print(A.df[:])

# This will print
#
#    rows  cols  a
# 0     1     1  1
# 1     2     3  3
# 2     2     4  2

# Close the array
A.close()

library(tiledb)

# Set your username and password
config <- tiledb_config()
config["rest.username"] <- 'xxx'
config["rest.password"] <- 'yyy'
ctx <- tiledb_ctx(config)

# This is the array URI format in TileDB Cloud
array_name = "tiledb://TileDB-Inc/quickstart_sparse"

# Open the array
arr <- tiledb_array(array_name, return_as='data.frame')

# Print the array schema
schema(arr)

# This will print:
# tiledb_array_schema(
#     domain=tiledb_domain(c(tiledb_dim(name="rows", domain=c(1L,4L), tile=4L, type="INT32"), tiledb_dim(name="cols", domain=c(1L,4L), tile=4L, type="INT32"))),
#     attrs=c(tiledb_attr(name="a", type="UINT32", ncells=1, nullable=FALSE)),
#     cell_order="ROW_MAJOR", tile_order="ROW_MAJOR", capacity=10000, sparse=TRUE, allows_dups=FALSE,
#     coords_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
#     offsets_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
    
# Print all the contents of the array
df <- arr[]
df

# This will print
#   rows cols a
# 1    1    1 1
# 2    2    3 3
# 3    2    4 2

# Close the array
tiledb_array_close(arr)

//set the config
Config config = new Config();
config.set("rest.username", "xxx");
config.set("rest.password", "yyy");
Context ctx = new Context(config);

//open the array
Array array = new Array(ctx, "tiledb://TileDB-Inc/quickstart_sparse");

//print the array schema
System.out.println(array.getSchema());

//this will print:
//ArraySchema<TILEDB_SPARSE io.tiledb.java.api.Domain@51462713 Attr<a,TILEDB_UINT32,1>>

Congratulations! You just performed your first array query to a public array in TileDB Cloud!

Access via the TileDB Cloud Client

There are several TileDB Clients that allow you to perform pretty much any kind of task you would otherwise do via the TileDB Cloud online console (also described later in this tutorial). Let's first install it with:

# At your shell prompt
pip install tiledb-cloud

# At your shell prompt
# See https://github.com/TileDB-Inc/TileDB-Cloud-R/releases for latest:
remotes::install_github('TileDB-Inc/TileDB-Cloud-R@v0.0.8')

// Maven
//Include this in your project:
<dependency>
  <groupId>io.tiledb</groupId>
  <artifactId>tiledb-cloud-java</artifactId>
  <version>X.X.X</version>
</dependency>

// Or build from source
$ git clone https://github.com/TileDB-Inc/TileDB-Cloud-Java.git
$ cd TileDB-Cloud-Java
$ ./gradlew assemble

Check that it installed properly as follows:

$ python
>>> import tiledb.cloud

$ R
> library(tiledbcloud)

Let's get the description of the array we used above:

import tiledb.cloud

# Set your username and password
config = tiledb.Config()
config["rest.username"] = "xxx"
config["rest.password"] = "yyy"

info = tiledb.cloud.info("tiledb://TileDB-Inc/quickstart_sparse")
print(info)

# This prints:
#
# {'access_credentials_name': None,
# 'allowed_actions': ['read', 'read_array_info', 'read_array_schema'],
# 'description': '# Quickstart Sparse Array\n'
#                '\n'
#                'This is a very simple 2D sparse array, used to quickly '
#                'demonstrate basic TileDB Cloud functionality.',
# 'file_properties': None,
# 'file_type': None,
# 'id': '69ca9424-2578-44d0-8662-87897d9e2941',
# 'last_accessed': datetime.datetime(2021, 5, 3, 19, 1, 15, tzinfo=tzutc()),
# 'license_id': 'MIT',
# 'license_text': 'MIT License Copyright (c) <year> <copyright holders>\n'
#                 '\n'
#                 'Permission is hereby granted, free of charge, to any person '
#                 'obtaining a copy of this software and associated '
#                 'documentation files (the "Software"), to deal in the '
#                 'Software without restriction, including without limitation '
#                 'the rights to use, copy, modify, merge, publish, distribute, '
#                 'sublicense, and/or sell copies of the Software, and to '
#                 'permit persons to whom the Software is furnished to do so, '
#                 'subject to the following conditions:\n'
#                 '\n'
#                 'The above copyright notice and this permission notice '
#                 '(including the next paragraph) shall be included in all '
#                 'copies or substantial portions of the Software.\n'
#                 '\n'
#                 'THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY '
#                 'KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE '
#                 'WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR '
#                 'PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS '
#                 'OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR '
#                 'OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR '
#                 'OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE '
#                 'SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.',
# 'logo': None,
# 'name': 'quickstart_sparse',
# 'namespace': 'TileDB-Inc',
# 'namespace_subscribed': False,
# 'pricing': None,
# 'public_share': True,
# 'share_count': 1.0,
# 'size': 903.0,
# 'subscriptions': None,
# 'tags': None,
# 'tiledb_uri': 'tiledb://TileDB-Inc/quickstart_sparse',
# 'type': 'sparse',
# 'uri': None

library(tiledbcloud)

# Set your username and password
config <- tiledb_config()
config["rest.username"] <- 'xxx'
config["rest.password"] <- 'yyy'
ctx <- tiledb_ctx(config)

info <- tiledbcloud::array_info(
  namespace="TileDB-Inc",
  arrayname="quickstart_dense"))

str(info)

# This prints:
# List of 14
#  $ id                  : chr "2d6e7def-851e-4832-ae92-202af1f6940d"
#  $ namespace           : chr "TileDB-Inc"
#  $ size                : int 835
#  $ last_accessed       : chr "2022-02-28T17:39:48Z"
#  $ description         : chr "# Quickstart Dense Array\n\nThis array is the results of running the quickstart dense example program. This arr"| __truncated__
#  $ name                : chr "quickstart_dense"
#  $ type                : chr "dense"
#  $ share_count         : int 1
#  $ public_share        : logi TRUE
#  $ namespace_subscribed: logi FALSE
#  $ tiledb_uri          : chr "tiledb://TileDB-Inc/quickstart_dense"
#  $ tags                : list()
#  $ license_id          : chr "MIT"
#  $ license_text        : chr "MIT License Copyright (c) <year> <copyright holders>\n\nPermission is hereby granted, free of charge, to any pe"| __truncated__

ApiClient defaultClient = Configuration.getDefaultApiClient();
defaultClient.setBasePath("https://api.tiledb.com/v1");

// Configure HTTP basic authorization: BasicAuth
HttpBasicAuth BasicAuth = (HttpBasicAuth) defaultClient.getAuthentication("BasicAuth");
BasicAuth.setUsername("xxx");
BasicAuth.setPassword("yyy");

ArrayApi apiInstance = new ArrayApi(defaultClient);

String namespace = "TileDB-Inc"; // String | namespace array is in (an organization name or user's username)
String array = "quickstart_sparse"; // String | name/uri of array that is url-encoded
try {
    ArrayInfo result = apiInstance.getArrayMetadata(namespace, array);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ArrayApi#getArrayMetadata");
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

//this prints:
//class ArrayInfo {
//  id: 69ca9424-2578-44d0-8662-87897d9e2941
//  fileType: null
//  fileProperties: null
//  uri: null
//  namespace: TileDB-Inc
//  size: 903
//  lastAccessed: 2022-06-01T15:01:16Z
//  description: # Quickstart Sparse Array  
//  This is a very simple 2D sparse array, used to quickly demonstrate basic TileDB Cloud functionality.
//  name: quickstart_sparse
//  allowedActions: [read, read_array_info, read_array_schema]
//  pricing: null
//  subscriptions: null
//  logo: null
//  accessCredentialsName: null
//  type: sparse
//  shareCount: 1
//  publicShare: true
//  namespaceSubscribed: false
//  tiledbUri: tiledb://TileDB-Inc/quickstart_sparse
//  tags: [one, three, two]
//  licenseId: MIT
//  licenseText: MIT License Copyright (c) <year> <copyright holders>  
//  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
//  The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.  
//  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
//  readOnly: false
//  isFavorite: null
//}

Great work! In the following we will see how to view useful information directly through the TileDB Cloud console.

Using the TileDB Cloud Console

You can view the information of public arrays by directly navigating to the array's URL on TileDB Cloud, even if you are not signed in (TileDB Cloud provides "static view" of arrays). For example, you can click directly on https://cloud.tiledb.com/arrays/details/TileDB-Inc/quickstart_sparse/overview, which is the URL of the array we used in the previous sections. Once you do so, you will see:

Click on the Schema tab to see the array schema:

Next, let's sign in and see some activity logs. Click on Activity on the left-hand side menu, under Assets:

Here you can see the various tasks that you have performed (like slicing an array), along with other information, such as the time, cost, duration, etc. Clicking on a task provides further information (to be covered in other tutorials).

Finally, you can view your profile information by clicking Profileon the left-hand side menu

Here you can edit your personal information, change your password, etc.

What's Next?

Are interested in diving deeper into TileDB Cloud? Here is what we recommend you to do next.

Quick recipes:

Create API tokens for faster programmatic login.
Set up AWS credentials so that you can access and share your arrays through TileDB Cloud.
Add billing info, so that you can use the service once you run out of credits.

Next tutorials:

Get a console walkthrough
Start learning about serverless compute
Familiarize yourself with the power of task graphs
Learn to use TileDB Cloud in specific use cases

Product Tour

This page is currently under development and will be updated soon.

Namespace selector

TileDB Cloud has a versatile namespace selector designed to enhance your experience in managing data and collaborations.

Upon signing up, each user is allocated a dedicated, private primary namespace. This namespace serves as your personal workspace, ensuring your data remains isolated and organized (until explicitly shared with other users or orgs).

In addition to your private namespace, TileDB offers the capability to create or join multiple organization namespaces. These spaces fosters seamless teamwork by allowing users to collaborate on projects, share resources, and collectively manage data.

The namespace selector enables effortless movement between these private and organizational spaces, facilitating a smooth transition as you navigate between different projects and contexts.

Notifications

You can receive notifications for various actions happening in TileDB Cloud. For instance, you can be notified when you're invited to join an organization or when someone shares an asset with you.

Overview

When you log in, the first page you see is Overview. Here you can see a summary of your assets, your current bill and your recent activity.

Launch server

You can easily launch Jupyter notebook server instances within TileDB Cloud.

Launching a Notebook server instance from TileDB Cloud, while boasting a wide range of advantages, might exhibit slightly extended launch times around 30-40 seconds. This is due to the careful allocation of resources that underpin its performance and dependability.

Assets sidebar

You can catalogue and access a wide-range of asset types in TileDB Cloud. From generic, fundamental assets like arrays, files, notebooks and UDFs to more sophisticated assets used across broader applications verticals, including geospatial analysis, genomics research and machine learning.

This holistic approach ensures that whether you're working with traditional data types or delving into specialized domains, TileDB Cloud is your one-stop solution for streamlined asset management across diverse fields.

Asset browsers

Each asset category has it's dedicated browser where you can also filter and search for specific assets. In the asset browser you can navigate between:

My tab : Your registered assets
Shared tab : Assets that are shared with you
Public tab : Publicly available assets
Favorites tab : Assets that are marked as favorite

You can use keywords in the search field to search by name, tag or phrases included in the description of the public data and code.

Assets

Assets constitutes data, code and data products that belong to you or an organization you are a member of, as well as data and code shared with you by other users and organizations.

Asset types

The asset categories currently supported by TileDB Cloud are listed in the table below. These assets can be registered and accessed in TileDB Cloud with various methods, described later in an another section.

Overview

You can preview various information from the overview tab of an asset. Rich descriptions, tags, URIs, permissions, versioning information along with some asset specific information.

Preview

The preview tab displays important information relative to the asset contents.

Previews are not supported for every asset type yet, but we continue to expand the feature gradually.

Schema

Explicitly on array assets you can view detailed information regarding the schema of the array.

Metadata

Most of the asset types come with metadata, either inherited by the asset type itself, or defined by the user.

Any asset can be shared with explicit permissions via username or email. If the email invited is not a TileDB account already, it will prompt the user for a signup first.

Settings

From the settings tab you can update your asset description and license, assign tags, rename or remove your asset, change the cloud credentials and make your asset publicly accessible.

Actions

Some assets have specific actions associated with them (highlighted by the blue buttons) such as the ability to download the asset, copy it to another namespace, launch a notebook or quickly add a description to the asset.

Adding assets

Adding assets to TileDB Cloud usually consists of two actions:

The creation or transformation of an existing or new asset to a multi-dimensional array. This can be done programmatically, via ingestion or straight from TileDB Cloud.
The registration of the asset from it's original storage location (usually S3, Azure or other cloud storage provider) to TileDB Cloud. Again this can happen either programmatically, via ingestion or from TileDB Cloud.

It's pretty common for the creation and registration of an asset to happen simultaneously.

For example when uploading a file from your computer at TileDB Cloud, it gets automatically transformed into an array, registered at your preferred namespace and saved to your selected cloud storage provider. Voila! 🔮

TileDB doesn't host any of your assets in it's own servers. Instead it utilises cloud-native practices to connect with all popular cloud storage providers such as S3, Azure and more.

Activity

You can view all the logged activity of assets you have access to.

Profile

You can view and edit your primary and organization profile, add cloud credentials, default storage paths, API tokens, and manage your billing.

On organizations you can also manage your team members.

Serverless Compute 101

In this tutorial, you will learn:

How to access (slice) a public array
How to perform a SQL query on a public array.
How to perform serverless UDFs on a public array.

Running on TileDB Cloud

Running on your own client

You can run all the commands of this notebook in your own client. The only changes required are:

Task Graphs 101

In this tutorial, you will learn:

How to use task graphs and specifically the Delayed API
How to scale your computation, significantly boosting performance, all serverless
How to eliminate egress costs

We will use public TileDB Cloud array TileDB-Inc/nyctlcyellowtripdata_2019, which stores the data from the NYC yellow taxi dataset for the year of 2019. The original data is in CSV format with collective size of about 7GB, which is converted into a TileDB 1D sparse array with the size being compressed down to ~1GB. The selected sparse dimension is tpep_pickup_datetime, which means that the array supports very fast range slicing (and, therefore, also partitioning) on that column of the dataset.

Running on TileDB Cloud

You can preview this tutorial as TileDB Cloud notebook (no login is needed). You can also easily launch it within the TileDB Cloud UI console, but you will need to sign up / login to do so.

Running on your own client

You can run all the commands of this notebook in your own client. The only changes required are:

Install the TileDB Cloud client
Log in using the TileDB Cloud client before running any notebook command

Use Cases

This part of the tutorials is work in progress. We will soon add tutorials for each of the use cases we are working on, such as dataframes, LiDAR, genomics, biomedical imaging, satellite imaging, weather, time series, and many more. Stay tuned!

Last updated May 12th

LiDAR

LiDAR Quickstart

In this tutorial you will learn:

How to slice LiDAR data natively from a TileDB array
How to visualize the sliced data.
How to run SQL queries on LiDAR data directly from TileDB

Running on TileDB Cloud

Running on your own client

You can run all the commands of this notebook in your own client. The only changes required are:

Concepts

Serverless Compute

When it comes to scalable analytics, we observed the following challenges:

Spinning up and monitoring clusters on the cloud is cumbersome and can get expensive.
Users frequently do not know how many machines to provision in a cluster for a given workload. This results in either under provisioning that impacts performance, or over provisioning that leads to wasted cost due to idle compute.
When users slice array data from TileDB Cloud only to further process it in their own compute environment, (1) they get charged for egress, and (2) the performance is impacted by the extra network transmission cost that occurs between the TileDB Cloud machines and their own machines.

TileDB Cloud allows users to access and compute on arrays in a serverless manner from the user's standpoint, i.e., without thinking about provisioning for machines, and paying for idle compute or unnecessary egress. TileDB Cloud automatically parallelizes all tasks across thousands of machines and monitors their progress.

TileDB Cloud supports the following tasks:

Users can submit numerous such tasks concurrently and TileDB Cloud will process all of them in parallel, elastically expanding and shrinking its computational resources on demand without supervision by the user. That is, TileDB Cloud provides extreme multi-tenancy by default.

Serverless SQL and array UDFs (i.e., UDFs that are specifically applied to one or more TileDB arrays) have the additional benefit that they can minimize egress by reducing the returned results size, which is true especially in aggregation queries.

Any distributed algorithm can be modeled as a directed graph, where the nodes represent atomic tasks and the edges represent tasks dependencies (i.e., a task cannot begin its execution before all the tasks from the incoming edges have completed their execution). TileDB Cloud supports such task graphs, which can be programmatically created by the user and submitted to the platform. TileDB Cloud is responsible for parallelizing all tasks while respecting the dependencies, and for monitoring all progress. Task graphs are a powerful tool for creating any sophisticated algorithm and scale it on TileDB Cloud.

TileDB Cloud also provides automation for spinning up JupyterLab instances, so that users can run Jupyter notebooks without having to manually set up servers and deploy JupyterLab. This makes it very easy for user to kickstart their data analysis on TileDB Cloud.

Finally, TileDB Cloud treats UDFs and notebooks as data and thus allows users to share runnable code just as easily as they share data. This makes TileDB Cloud a powerful platform for collaboration and reproducibility of scientific results.

Console and API

TileDB Cloud comes with two user-facing components:

It is generally faster to use API tokens in the TileDB Cloud client.

TileDB Cloud Internals

This page group provides details on the TileDB Cloud internal mechanics. You can navigate from the menu on the left, or through the following links:

Architecture

This page describes the architecture of our TileDB Cloud SaaS offering.

Currently, TileDB Cloud (SaaS) runs on AWS, but in the future it will be deployed on other cloud providers. The principles around multiple cloud regions and cloud storage described in the architecture below are directly extendible to other settings (on the cloud or on premises).

The following figure outlines the TileDB Cloud architecture, which is comprised of the following components:

Automatic Redirection
Orchestration
UI Console
System State
REST Workers
Jupyter Notebooks

We explain each of those components below.

Automatic Redirection

TileDB Cloud maintains compute clusters in multiple cloud regions, geographically distributed across the globe. The reason is that users may store their data in cloud buckets located in different regions, and it is always faster and more economical to send the compute to the data; that eliminates egress costs, reduces latency and increases network speeds. However, users may not know which region the array they are accessing is located.

To facilitate sending the compute to the appropriate region, TileDB Cloud supports automatic redirection using the Cloudflare Workers service. This provides a scalable and serverless way to lookup the region of the array being accessed (maintaining a fast key-value store that is always in sync with the System State) and issue a 302 temporary redirect to the HTTP request. TileDB Open Source and the TileDB Cloud client will honor the redirection and send the the request to the TileDB Cloud service in the proper region (see Orchestration).

If your array lives in a cloud region unsupported by TileDB Cloud, the request is sent to us-east-1. We plan a future improvement to redirect to the nearest region instead.

Currently, automatic redirection is enabled by default, and the behavior can be controlled by using a configuration parameter. The user can also always dispatch any query directly to a specific region.

Orchestration

In every cloud region, TileDB Cloud maintains a Kubernetes cluster that carries out all tasks, properly autoscaling and load balancing to match capacity with demand based upon several factors. We use the Kubernetes built in metrics and monitoring toolchain to ensure pod memory usage is monitored and we have an accurate picture of the real world workloads at all times.

Currently supported regions:

us-east-1
us-west-2
eu-west-2
ap-southeast-1

In each region we use a variety of compute EC2 instance types, predominantly from m5, c5 and r5classes.

UI Console

The TileDB Cloud user interface console (https://cloud.tiledb.com) is a web app written in React that uses the REST Workers API across the same procedures and protocols as the clients. Many of the same routes are also used directly from one of the many clients, such a TileDB-Cloud-Py or TileDB-Cloud-R. The console web app autoscales based on the load, but currently it runs only inside the us-east-1cluster.

System State

TileDB Cloud maintains persistent state about user records, arrays, UDFs, billing, activity and more by using an always encrypted MariaDB instance. This instance is maintained in the us-east-1 region. In addition, this state is replicated and synced at all times with a read-only MariaDB instance maintained in every other supported region, in order to reduce latency for the queries executed in those regions.

REST Workers

TileDB Cloud's architecture is centered around a REST API Service. The service is a Go based application which provides all of the base functionality such as user management, authentication and access control, billing and monetization (via integration with Stripe), UDF execution, and serverless SQL orchestration used in TileDB Cloud. The REST Service is deployed in Kubernetes with a stateless design that allows for distributed orchestration and execution without the need for centralized coordination or locking.

The REST Service monitors resource usage and does its own book keeping in order to determine if it can service a request or if it should inform the client to retry later. By allowing the client to manage retries and with the high availability of the REST service architecture. TileDB Cloud is able to gracefully load balance and distribute the work across multiple instances.

The REST service handles the following types of serverless tasks, building upon the TileDB Open Source library:

Access (read/write)
SQL
UDFs

Jupyter Notebooks

TileDB Cloud offers hosted Jupyter notebooks by using Jupyter Docker Stacks for the base conda environments, and Jupyterhub / Zero to Jupyterhub K8S for the runtime environment. The notebooks are spawned inside Kubernetes using kubespawner to offer an isolated environment for each user with their own dedicated and persisted storage.

Currently, Jupyter notebooks can be spawned in the us-east-1 region, but soon TileDB Cloud will support multiple regions for notebooks.

Connectivity (Firewall) Requirements

TileDB Cloud runs over standard http connectivity, using tcp ports 80 and443. Connection made on port 80 are automatically redirected to https over port 443.

Open ID Connect

TileDB Cloud provides Open ID Connect support that can be used with any Open ID Connect compatible service. TileDB Cloud provide a fixed set of IP address used for the outbound request as part of the Open ID Connect sequence.

eu-west-2

13.41.67.254
18.134.194.194
18.135.61.196

us-west-2

35.81.95.218
54.185.206.57
54.189.31.204

us-east-1

52.21.38.106
54.87.160.2
52.70.6.129

ap-southeast-1

13.213.235.67
54.255.255.186
52.76.199.70

See Corporate SSO with TileDB Cloud SaaS if you are interested in enabling OIDC support for TileDB Cloud SaaS in your own environment.

Array Access

Array access refers to any read or write operation to an array registered with TileDB Cloud and referenced via its tiledb:// URI. Each array access is directed to a particular Kubernetes cluster in a specific cloud region as explained in Automatic Redirection. Then this request is assigned to a REST worker pod in an elastic and load balanced manner. That worker uses 16 CPU cores and sets the total result buffer size for TileDB Open Source to 2GB RAM.

The REST worker performs authentication (looking up the system state), logs all activity, manages billing and monetization, and enforces the access policies. Most importantly, each REST worker is totally stateless, and requires no synchronization or locking, allowing TileDB Cloud to scale very gracefully and quickly recover from failure via retry policies.

Access Control and Logging

One of the most powerful feature of TileDB Cloud is that it allows users to share arrays, UDFs and notebooks at extreme scale, with anyone on the planet, and with diverse polices (e.g., read, write, read/write). There are no restrictions on the number of users data and code can be shared with.

Currently, TileDB Cloud supports access policies at the array level. However, soon it will support finer-grained access policies at the cell level.

TileDB Cloud also enables users to create organizations, in order to better manage access to their assets and manage billing. You can create any number of organizations.

TileDB Cloud maintains a global system state using MariaDB, recording all information required to know which assets belong to which users and who has access to the various assets.

TileDB Cloud logs everything: the task types, the users that initiated them, duration, cost, etc. All this information gets logged by the REST workers into the persistent and encrypted MariaDB instance. The activity can then be browsed on the TileDB Cloud UI console or retrieved programmatically using the TileDB Cloud client. Six months of logs are made available for instant retrieval. Contact us if you need longer retention or ways to perform offline audits of historical logs for your organization.

By default, sessions on TileDB Cloud will timeout after 8 hours. SSO session timeout is controlled by organizational policies.

Serverless SQL

TileDB Cloud allows you to perform any SQL query on your TileDB arrays in a serverless manner. No need to spin up or tear down any computational resources. You just need to have the TileDB Cloud client installed (see Installation). You get charged only for the time it took to run the SQL operation.

TileDB Cloud currently supports serverless SQL only through Python, R, and Java, but support for more languages will be added soon.

TileDB Cloud receives your SQL query and executes it on a stateless REST worker that runs a warm MariaDB instance using the MyTile storage engine. The results of the SQL query can be returned directly to you (when using TileDB-Cloud-Py version 0.4.0 or newer) or they can be written back to an S3 array of your choice (either existing or new). Any array access happens on the same REST instance running the SQL query to optimize performance.

When results are returned directly, they are sent to the client in either JSON or Apache Arrow format, and in Python they are converted into a Pandas dataframe. This is most suitable for small results, such as aggregations or limit queries.

Writing results to an S3 array is necessary to allow processing of SQL queries with large results, without overwhelming the user (who may be on their laptop). The user can always open the created TileDB array and fetch any desired data afterwards.

Each TileDB Cloud REST worker running a SQL query uses 16 CPUs and has a limit of 2GB RAM. Therefore, you must consider "sharding" a SQL query so that each result fits in 2GB of memory (see Task Graphs). In the future, TileDB Cloud will offer flexibility in choosing the types of resources to run SQL on.

All SQL queries will time out after 15 minutes.

Serverless UDFs

There are two types of supported UDFs:

Generic: These can include any code.
Array UDFs: These are UDFs that are applied to slices of one or more arrays.

Running UDFs is particularly useful if you want to perform reductions (such as a sum or an average), since the amount of data returned is very small regardless of how much data the UDF processes.

TileDB Cloud currently supports only Python and R UDFs, but support for more languages will be added soon.

TileDB Cloud runs your UDF in a separate dedicated container for security. Any array access is executed in parallel on the same REST worker but separate containers, and the results are sent to the UDF container using zero-copy techniques for performance.

We offer Python and R UDF images based on the following versions:

Python 3.7 is deprecated in User Defined Functions and is no longer updated as of January 31st, 2024. Registered User Defined Functions under python 3.7 will continue to be available for execution with the packages listed on this page until August, 2024.

In the default environment that the UDF runs, we include the following Python packages:

In the default environment that the UDF runs, we include the following R packages:

Geospacial image (geo) is based on Python images and include the following packages:

Genomics image (genomics) is based on Python images and include the following packages:

Imaging image (imaging-dev) is based on Python images and include the following packages:

Vector search image (vectorsearch) is based on Python images and includes the following packages:

Each UDF allows for the following configurations to be used:

In the future, TileDB Cloud will offer more flexibility in choosing the types of resources to run the UDF on.

All UDFs will time out by default after 15 minutes, the value is configurable when submitting a UDF by using the timeout parameter.

Task Graphs

TileDB Cloud allows you to build arbitrary (directed acyclic) task graphs to combine any number of different tasks into one workflow. You can combine serverless UDFs, SQL and array access along with even local execution of any function.

TileDB Cloud currently supports serverless task graphs only in Python, but support for more languages will be added soon.

The task graph is currently driven by the client. The client can be in a hosted notebook, your local laptop, or even a serverless UDF itself. The client manages the graph, and dispatches the execution of severless jobs or local functions.

Currently, there is no node-to-node communication in a task graph. TileDB does offer server side passing of inputs and outputs without round tripping to a client. This provides the ability to efficiently pass data between stages of the task graph.

The local driver uses the Python ThreadPoolExecutor by default to drive the tasks. The default number of workers is 4 * #cores on the client machine. Python allows multiple serverless tasks to run as they use asynchronous HTTP requests. Serverless tasks will scale elastically. As you request more tasks to be run, TileDB Cloud launches more resources to accommodate the tasks.

Local functions are subject to the Python GIL (interpreter lock) if the task graphs use the ThreadPoolExecutor (default). This limits the concurrency of local functions, however serverless functionality is minimally effected.

Jupyter Notebooks

TileDB Cloud enables the user to launch Jupyter notebooks within the UI console. It spins up Jupyter notebook instances in the Kubernetes cluster in us-east-1. The user can install any extra packages in the notebook. The notebook server environment is destroyed on shutdown. Any extra packages installed will not persist across server instances.

Every user gets a 2GB persistent storage in an EBS volume (also in us-east-1). This is mounted as the home directory in the notebook server. All contents in the home directory will persist across server restarts. The user does not get charged for storage!

Currently, TileDB offers two notebook server sizes:

Size

CPUs

Memory

Small

2GB

Large

60GB

As explained in the Pricing and Billing section, notebooks are charged based on the size of the notebook server and duration it is run for.

Currently notebook usage is charged either to an organization a user belongs to or, if the user is not part of an organization, to the user themselves. We plan a future improvement to allow selecting who to charge for the notebook usage.

TileDB Cloud offers three notebook images, with the following installed packages:

Basic Data Science:tiledb, libtiledb-sql-py, plotly, ipywidgets, graphviz, pandas, pydot, trimesh, numpy, chardet, numba, tiledb-r, voila, opencv, tiledb-cloud, pybabylonjs, envbash, tiledb-ml
Genomics: Everything in the Basic Data Science notebook plus:snakemake, tiledb-vcf, htslib, bcftools, pybedtools
Geospatial: Everything in the Basic Data Science notebook plus:cartopy, datashader, descartes, folium, geos, geotiff, holoviews, imagemagick, laszip, libnetcdf, proj, shapely, scikit-build, proj, gdal, rasterio, mb-system, pdal, fiona, geopandas, scikit-mobility, xarray, tiledb-segy, capella-tools

Marketplace

You need to have a Stripe account in order to take advantage of the monetization feature.

The monetization feature is currently in beta and works for arrays, but not for UDFs and notebooks yet.

You can select any of the arrays you own and specify the following:

How To

Account

This page group contains simple recipes for managing your account. You can find the contents below:

Create API Tokens

This page is currently under development and will be updated soon.

You can create an API token by navigating to API tokensin Settings , as shown below. You can also provide an expiration date for your token upon creation. You can create multiple tokens and revoke them any time.

Set Up Credentials

This page is currently under development and will be updated soon.

In order to be able to create, register, and access arrays through the TileDB Cloud service, you need to set up access credentials. For S3 compatible object stores, TileDB Cloud supports both IAM Roles and Access Credential key pairs. TileDB Cloud securely stores all keys in an encrypted database and never grants your keys to any other user. TileDB Cloud uses your keys in containarized stateless workers, which are under TileDB's full control and inaccessible by any other user's code (e.g., SQL or UDFs).

Note: You can add multiple AWS keys to TileDB Cloud, register different arrays with different keys, select a key to be your default key, and revoke any key at any time.

AWS Access Keys

You can add your AWS keys from the AWS credentials tab of Settings as follows:

AWS Assume Role

With an AWS AssumeRole policy we are solving the very same issue we used keys before: Enable AWS cross-account access, so that a role in one account can access a bucket in a separate account.

When using AWS AssumeRole, temporary keys are created through the Service Token Service (STS), and used from the deployment party (TileDB Cloud Console). This means that for organisation purposes there is no need to create an AWS IAM User for every user logging into TileDB Cloud Console and generate key pairs. Instead, after a User is authenticated, the AssumeRole functionality enables TileDB Cloud Console to access the bucket on behalf of a User and the credentials used in that case can be reused by multiple Users in the same organisation that need to access the same S3 buckets.

As an example, let 's consider the account (Account A) we are signing up with TileDB Cloud to access bucket(s) in User's AWS account (Account B). For that purpose, Account B has a bucket created. The most common setup is to create an IAM role for TileDB Cloud to use and then allow it to access a specific bucket with an AWS S3 bucket policy. Requests for access to the bucket will only be granted coming from our AWS account with our external ID.

Steps:

In TileDB Cloud Console navigate to Settings then select the tab Cloud Credentials

Click Add credentials, then select ARN Role and click Next and Next in the following step which is just a short description

Select tab Existing Role that presents the Account A ID as well as the External ID

Select tab New Role that proposes the JSON Account B can use to create the role. Please note Account A Account ID as well as the External ID

In Account B, User (or Admin) can create the bucket policy

In Account B User (or Admin) can create the role, using Account A ID and External ID

In Account B User (or Admin) has to attach the policy to the role

Obtain ARN for the new role

Then press Next in TileDB Console Add Credentials modal dialog and enter a name for the new AssumeRole Credentials and the ARN obtained in previous step

Test the connection

Example configurations have been detailed below:

AWS IAM Role:

Note: Both the AWS Principal and External ID will be provided when attempting to register the ARN role

AWS S3 Bucket Policy:

Setup KMS for the target bucket

It is possible that encryption is needed for the target bucket

To enable KMS usage for the target bucket, it is needed to edit the policy for the KMS key and add a statement that gives access to the role used previously

Example configuration is provided below

Edit Your Profile

Adjust your account details

This page is currently under development and will be updated soon.

You can change any information in your profile from the Profile tab in your Settings:

Set Up Default Storage Path

This page is currently under development and will be updated soon.

When you create UDFs, notebooks, dashboards, groups, and ML models, TileDB Cloud stores each asset as a TileDB object (array or group). In order to do so, it requires you to provide a default storage path for storing those assets. You can do so from your profile settings as follows:

Manage Billing

This page is currently under development and will be updated soon.

You can add your payment method and see your billing information, including your current balance breakdown and all invoices, by navigating to the Billing tab of your Settings :

Arrays

This page group contains simple recipes for managing your arrays. You can find the contents below:

Create Arrays

Instead of using <array-uri> as you would typically in TileDB Open Source, you must use tiledb://<username>/<array-uri>. For example, if you wish to create an array at s3://my_bucket/my_array, you need to set the array URI to tiledb://my_username/s3://my_bucket/my_array and TileDB Open Source will instruct TileDB Cloud to automatically register the array as tiledb://my_username/my_array.

Register Arrays

There is absolutely no data movement upon registering an array with TileDB Cloud. Your data continues to remain in your bucket and you are the full owner of this data. TileDB Cloud just records the path and the AWS keys that can access it, so that it can govern access when you or the users you share this array with need to access it.

View Array Details

This page is currently under development and will be updated soon.

Deleting (deregistering) an array (in Settings) does not physically delete your array from the physical cloud storage. It simply deregisters the array from TileDB Cloud. Your data will still be accessible by you outside of TileDB Cloud if you own the appropriate AWS access keys.

Renaming an array (in Settings) is under the danger zone, because from that point onwards you (and all the users you shared the array with) will have to change your code to add the new array name. The array will still be shared and accessible by the other users, but they will need to add the new name to their code. In other words, TileDB Cloud does not currently support automatic redirection of array URIs upon renaming.

Share Arrays

This page is currently under development and will be updated soon.

You can share a registered array with any other user on TileDB Cloud. Currently, you can specify array-wide policies, such as read, write and read/write. We plan to add finer-grained access policies soon. To share an array, find it on Assets -> Arrays and either click on the sharing button located on the right end of the array card in the list, or click on the array card and navigate to the Sharing tab. The added member will appear in the array members list, where you will be able to change the access policy or revoke access from the user. Users get notified by email when someone shares an array with them.

When sharing an array with other users, you do not get charged for the accesses that those users make. You only get charged for the accesses that you make on your arrays.

When sharing with a member, TileDB Cloud uses auto-complete to facilitate finding a username you are looking for. Similar to GitHub/GitLab, the usernames are considered public information (in contrast to full names and emails that are protected). Please email us at privacy@tiledb.com if you wish your username to be excluded from auto-complete.

Note that the array URL when you are viewing its Overview is shareable, and another user can view it on their browser if they have access to it. URLs of public arrays can be viewed by users, even if they are not logged in.

API Reference

More Help

Start Here!

This page is currently under development and will be updated soon.

In this tutorial you will learn how to:

Sign up / sign in to TileDB Cloud
Access a public array using the TileDB-Py library
Access basic array information using the TileDB Cloud client
View the public array on the TileDB Cloud console
View a task on the TileDB Cloud console
View and edit your profile on the TileDB Cloud console

Access a Public Array

# Pip:
$ pip install tiledb

# Or Conda:
$ conda install -c conda-forge tiledb-py

# Via CRAN
install.packages("tiledb")

// Maven
//Include this in your Maven project:
<dependency>
  <groupId>io.tiledb</groupId>
  <artifactId>tiledb-java</artifactId>
  <version>X.X.X</version>
</dependency>

// Or build from source
$ git clone https://github.com/TileDB-Inc/TileDB-Java.git
$ cd TileDB-Java
$ ./gradlew assemble

To validate the installation, run:

$ python
>>> import tiledb
>>> tiledb.__version__
'x.y.z' # The version will appear here

$ R
> library(tiledb)
> packageVersion("tiledb")
'x.y.z' # The version will appear here

// Compile and run:
Version version = new Version();
System.out.println(version);

For Python, also make sure you have already installed pandas and pyarrow, or alternatively run:

$ conda install pandas
$ conda install -c conda-forge pyarrow

import tiledb

# Set your username and password
config = tiledb.Config()
config["rest.username"] = "xxx"
config["rest.password"] = "yyy"

# This is the array URI format in TileDB Cloud
array_name = "tiledb://TileDB-Inc/quickstart_sparse"

# Open the array
A = tiledb.open(array_name, 'r', ctx=tiledb.Ctx(config))

# Print the array schema
print(A.schema)
    
# This will print:
#
# ArraySchema(
#  domain=Domain(*[
#    Dim(name='rows', domain=(1, 4), tile=4, dtype='int32'),
#    Dim(name='cols', domain=(1, 4), tile=4, dtype='int32'),
#  ]),
#  attrs=[
#    Attr(name='a', dtype='uint32', var=False, nullable=False),
#  ],
#  cell_order='row-major',
#  tile_order='row-major',
#  capacity=10000,
#  sparse=True,
#  allows_duplicates=False,
#  coords_filters=FilterList([ZstdFilter(level=-1)]),
#)

# Print all the contents of the array
print(A.df[:])

# This will print
#
#    rows  cols  a
# 0     1     1  1
# 1     2     3  3
# 2     2     4  2

# Close the array
A.close()

library(tiledb)

# Set your username and password
config <- tiledb_config()
config["rest.username"] <- 'xxx'
config["rest.password"] <- 'yyy'
ctx <- tiledb_ctx(config)

# This is the array URI format in TileDB Cloud
array_name = "tiledb://TileDB-Inc/quickstart_sparse"

# Open the array
arr <- tiledb_array(array_name, return_as='data.frame')

# Print the array schema
schema(arr)

# This will print:
# tiledb_array_schema(
#     domain=tiledb_domain(c(tiledb_dim(name="rows", domain=c(1L,4L), tile=4L, type="INT32"), tiledb_dim(name="cols", domain=c(1L,4L), tile=4L, type="INT32"))),
#     attrs=c(tiledb_attr(name="a", type="UINT32", ncells=1, nullable=FALSE)),
#     cell_order="ROW_MAJOR", tile_order="ROW_MAJOR", capacity=10000, sparse=TRUE, allows_dups=FALSE,
#     coords_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
#     offsets_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
    
# Print all the contents of the array
df <- arr[]
df

# This will print
#   rows cols a
# 1    1    1 1
# 2    2    3 3
# 3    2    4 2

# Close the array
tiledb_array_close(arr)

//set the config
Config config = new Config();
config.set("rest.username", "xxx");
config.set("rest.password", "yyy");
Context ctx = new Context(config);

//open the array
Array array = new Array(ctx, "tiledb://TileDB-Inc/quickstart_sparse");

//print the array schema
System.out.println(array.getSchema());

//this will print:
//ArraySchema<TILEDB_SPARSE io.tiledb.java.api.Domain@51462713 Attr<a,TILEDB_UINT32,1>>

Congratulations! You just performed your first array query to a public array in TileDB Cloud!

Access via the TileDB Cloud Client

# At your shell prompt
pip install tiledb-cloud

# At your shell prompt
# See https://github.com/TileDB-Inc/TileDB-Cloud-R/releases for latest:
remotes::install_github('TileDB-Inc/TileDB-Cloud-R@v0.0.8')

// Maven
//Include this in your project:
<dependency>
  <groupId>io.tiledb</groupId>
  <artifactId>tiledb-cloud-java</artifactId>
  <version>X.X.X</version>
</dependency>

// Or build from source
$ git clone https://github.com/TileDB-Inc/TileDB-Cloud-Java.git
$ cd TileDB-Cloud-Java
$ ./gradlew assemble

Check that it installed properly as follows:

$ python
>>> import tiledb.cloud

$ R
> library(tiledbcloud)

Let's get the description of the array we used above:

import tiledb.cloud

# Set your username and password
config = tiledb.Config()
config["rest.username"] = "xxx"
config["rest.password"] = "yyy"

info = tiledb.cloud.info("tiledb://TileDB-Inc/quickstart_sparse")
print(info)

# This prints:
#
# {'access_credentials_name': None,
# 'allowed_actions': ['read', 'read_array_info', 'read_array_schema'],
# 'description': '# Quickstart Sparse Array\n'
#                '\n'
#                'This is a very simple 2D sparse array, used to quickly '
#                'demonstrate basic TileDB Cloud functionality.',
# 'file_properties': None,
# 'file_type': None,
# 'id': '69ca9424-2578-44d0-8662-87897d9e2941',
# 'last_accessed': datetime.datetime(2021, 5, 3, 19, 1, 15, tzinfo=tzutc()),
# 'license_id': 'MIT',
# 'license_text': 'MIT License Copyright (c) <year> <copyright holders>\n'
#                 '\n'
#                 'Permission is hereby granted, free of charge, to any person '
#                 'obtaining a copy of this software and associated '
#                 'documentation files (the "Software"), to deal in the '
#                 'Software without restriction, including without limitation '
#                 'the rights to use, copy, modify, merge, publish, distribute, '
#                 'sublicense, and/or sell copies of the Software, and to '
#                 'permit persons to whom the Software is furnished to do so, '
#                 'subject to the following conditions:\n'
#                 '\n'
#                 'The above copyright notice and this permission notice '
#                 '(including the next paragraph) shall be included in all '
#                 'copies or substantial portions of the Software.\n'
#                 '\n'
#                 'THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY '
#                 'KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE '
#                 'WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR '
#                 'PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS '
#                 'OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR '
#                 'OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR '
#                 'OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE '
#                 'SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.',
# 'logo': None,
# 'name': 'quickstart_sparse',
# 'namespace': 'TileDB-Inc',
# 'namespace_subscribed': False,
# 'pricing': None,
# 'public_share': True,
# 'share_count': 1.0,
# 'size': 903.0,
# 'subscriptions': None,
# 'tags': None,
# 'tiledb_uri': 'tiledb://TileDB-Inc/quickstart_sparse',
# 'type': 'sparse',
# 'uri': None

library(tiledbcloud)

# Set your username and password
config <- tiledb_config()
config["rest.username"] <- 'xxx'
config["rest.password"] <- 'yyy'
ctx <- tiledb_ctx(config)

info <- tiledbcloud::array_info(
  namespace="TileDB-Inc",
  arrayname="quickstart_dense"))

str(info)

# This prints:
# List of 14
#  $ id                  : chr "2d6e7def-851e-4832-ae92-202af1f6940d"
#  $ namespace           : chr "TileDB-Inc"
#  $ size                : int 835
#  $ last_accessed       : chr "2022-02-28T17:39:48Z"
#  $ description         : chr "# Quickstart Dense Array\n\nThis array is the results of running the quickstart dense example program. This arr"| __truncated__
#  $ name                : chr "quickstart_dense"
#  $ type                : chr "dense"
#  $ share_count         : int 1
#  $ public_share        : logi TRUE
#  $ namespace_subscribed: logi FALSE
#  $ tiledb_uri          : chr "tiledb://TileDB-Inc/quickstart_dense"
#  $ tags                : list()
#  $ license_id          : chr "MIT"
#  $ license_text        : chr "MIT License Copyright (c) <year> <copyright holders>\n\nPermission is hereby granted, free of charge, to any pe"| __truncated__

ApiClient defaultClient = Configuration.getDefaultApiClient();
defaultClient.setBasePath("https://api.tiledb.com/v1");

// Configure HTTP basic authorization: BasicAuth
HttpBasicAuth BasicAuth = (HttpBasicAuth) defaultClient.getAuthentication("BasicAuth");
BasicAuth.setUsername("xxx");
BasicAuth.setPassword("yyy");

ArrayApi apiInstance = new ArrayApi(defaultClient);

String namespace = "TileDB-Inc"; // String | namespace array is in (an organization name or user's username)
String array = "quickstart_sparse"; // String | name/uri of array that is url-encoded
try {
    ArrayInfo result = apiInstance.getArrayMetadata(namespace, array);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ArrayApi#getArrayMetadata");
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

//this prints:
//class ArrayInfo {
//  id: 69ca9424-2578-44d0-8662-87897d9e2941
//  fileType: null
//  fileProperties: null
//  uri: null
//  namespace: TileDB-Inc
//  size: 903
//  lastAccessed: 2022-06-01T15:01:16Z
//  description: # Quickstart Sparse Array  
//  This is a very simple 2D sparse array, used to quickly demonstrate basic TileDB Cloud functionality.
//  name: quickstart_sparse
//  allowedActions: [read, read_array_info, read_array_schema]
//  pricing: null
//  subscriptions: null
//  logo: null
//  accessCredentialsName: null
//  type: sparse
//  shareCount: 1
//  publicShare: true
//  namespaceSubscribed: false
//  tiledbUri: tiledb://TileDB-Inc/quickstart_sparse
//  tags: [one, three, two]
//  licenseId: MIT
//  licenseText: MIT License Copyright (c) <year> <copyright holders>  
//  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
//  The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.  
//  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
//  readOnly: false
//  isFavorite: null
//}

Great work! In the following we will see how to view useful information directly through the TileDB Cloud console.

Using the TileDB Cloud Console

Click on the Schema tab to see the array schema:

Next, let's sign in and see some activity logs. Click on Activity on the left-hand side menu, under Assets:

Finally, you can view your profile information by clicking Profileon the left-hand side menu

Here you can edit your personal information, change your password, etc.

Congratulations! You have successfully completed your very first TileDB Cloud tutorial!

What's Next?

Are interested in diving deeper into TileDB Cloud? Here is what we recommend you to do next.

Quick recipes:

Create API tokens for faster programmatic login.
Set up AWS credentials so that you can access and share your arrays through TileDB Cloud.
Add billing info, so that you can use the service once you run out of credits.

Next tutorials:

Get a console walkthrough
Start learning about serverless compute
Familiarize yourself with the power of task graphs
Learn to use TileDB Cloud in specific use cases

Task Graphs

Delayed objects can be combined into a task graph, which is typically a directed acyclic graph (DAG). The output from one function or query can be passed into another, and dependencies are automatically determined.

from tiledb.cloud.compute import DelayedArrayUDF, Delayed, DelayedSQL
import numpy as np

# Build several delayed objects to define a graph
# Note that package numpy is aliased as np in the UDFs
local = Delayed(lambda x: x * 2, local=True)(100)
array_apply = DelayedArrayUDF("tiledb://TileDB-Inc/quickstart_sparse", 
        lambda x: np.sum(x["a"]), name="array_apply")([(1, 4), (1, 4)])
sql = DelayedSQL("select SUM(`a`) as a from `tiledb://TileDB-Inc/quickstart_dense`", name="sql")

# Custom function for averaging all the results we are passing in
def mean(local, array_apply, sql):
    return np.mean([local, array_apply, sql.iloc(0)[0]])

# This is essentially a task graph that looks like
#                 mean
#          /       |      \
#         /        |       \    
#      local  array_apply  sql 
#
# The `local`, `array_apply` and `sql` tasks will computed first,
# and once all three are finished, `mean` will computed on their results
res = Delayed(func_exec=mean, name="node_exec")(local, array_apply, sql)
print(res.compute())

library(tiledbcloud)

# Build several delayed objects to define a graph

# Locally executed; simple enough
local <- delayed(function(x) { x*2 }, local=TRUE)
delayed_args(local) <- list(100)

# Array UDF -- we specify selected ranges and attributes, then do some R on the
# dataframe which the UDF receives
array_apply <- delayed_array_udf(
  array="TileDB-Inc/quickstart_sparse",
  udf=function(df) { sum(as.vector(df[["a"]])) },
  selectedRanges=list(cbind(1,4), cbind(1,4)),
  attrs=c("a")
)

# SQL -- note the output is a dataframe, and values are all strings (MariaDB
# "decimal values") so we'll cast them to numeric later
sql <- delayed_sql(
  "select SUM(`a`) as a from `tiledb://TileDB-Inc/quickstart_dense`",
  name="sql"
)

# Custom function for averaging all the results we are passing in
ourmean <- function(local, array_apply, sql) {
    mean(c(local, array_apply, as.numeric(sql[["a"]])))
}

# This is essentially a task graph that looks like
#               ourmean
#          /       |      \
#         /        |       \    
#      local  array_apply  sql 
#
# The `local`, `array_apply` and `sql` tasks will computed first,
# and once all three are finished, `ourmean` will computed on their results.
# Note here we slot out the answer from the SQL dataframe using `[[...]]`,
# and also cast to numeric.
res <- delayed(ourmean, args=list(local, array_apply, sql))
print(compute(res, verbose=TRUE))

Modes of Operation

Realtime

The default mode of operation, realtime, is designed to return results directly to the client emphasis low latency. Realtime task graphs are scheduled and executed immediately and are well suited for fast distributed workloads.

Batch

In contrast to realtime task graphs, batch task graphs are designed for large, resource intensive asynchronous workloads. Batch task graphs are defined, uploaded, and scheduled for execution and are well suited for ingestion-style workloads.

Setting the Mode

The mode can be set for any of the APIs by passing in a mode parameter. Accepted values are BATCH or REALTIME

Delayed API mode:

batch_function = Delayed(numpy.median, mode=tiledb.cloud.dag.Mode.BATCH)

realtime_function = Delayed(numpy.median, mode=tiledb.cloud.dag.Mode.REALTIME)

batch_function <- delayed(median, mode=tiledbcloud::BATCH)

realtime_function <- delayed(median, mode=tiledbcloud::REALTIME)

Task Graph API mode:

batch_dag = tiledb.cloud.dag.DAG(mode=tiledb.cloud.dag.Mode.BATCH)

realtime_dag = dag = tiledb.cloud.dag.DAG(mode=tiledb.cloud.dag.Mode.REALTIME)

batch_dag = tiledbcloud::DAG(mode=tiledbcloud::BATCH)

realtime_dag = tiledbcloud::DAG(mode=tiledbcloud::REALTIME)

Generic Functions

Any Python/R function can be wrapped in a Delayed object, making the function executable as a future.

from tiledb.cloud.compute import Delayed
import numpy

# Wrap numpy median in a delayed object
x = Delayed(numpy.median)

# It can be called like a normal function to set the parameters
# Note at this point the function does not get executed since it
# is of "delayed" type
x([1,2,3,4,5])

# To force execution and get the result call `compute()`
print(x.compute())

library(tiledbcloud)

# Wrap median in a delayed object
x = delayed(median)

# You can set the parameters.  Note at this point the function does not
# get executed since it # is of "delayed" type.
delayed_args(x) <- list(c(1,2,3,4,5))

# To force execution and get the result call `compute()`
print(compute(x))

SQL and Arrays

Besides arbitrary Python/R functions, serverless SQL queries and array UDFs can also be called with the delayed API.

from tiledb.cloud.compute import DelayedSQL, DelayedArrayUDF
import numpy

# SQL
y = DelayedSQL("select AVG(`a`) FROM `tiledb://TileDB-Inc/quickstart_sparse`")

# Run query
print(y.compute())

# Array
z = DelayedArrayUDF("tiledb://TileDB-Inc/quickstart_sparse",
        lambda x: numpy.average(x["a"]))([(1, 4), (1, 4)])

# Run the UDF on the array
z.compute()

library(tiledbcloud)

y = delayed_sql("select AVG(`a`) FROM `tiledb://TileDB-Inc/quickstart_sparse`")

# Run query
print(compute(y))

# Array
z = delayed_array_udf(
  "tiledb://TileDB-Inc/quickstart_sparse",
  function(x) mean(x[["a"]]),
  selectedRanges=list(cbind(1,4), cbind(1,4)),
  attrs=c("a")
)

# Run the UDF on the array
print(compute(z))

Local Functions

It is also possible to include a generic Python function as delayed, but have it run locally instead of serverless on TileDB Cloud. This is useful for testing or for saving finalized results to your local machine, e.g., saving an image.

from tiledb.cloud.compute import Delayed
import numpy

# Set the `local` argument to `True`
local = Delayed(numpy.median, local=True)([1,2,3])

# This will compute locally
local.compute()

library(tiledbcloud)

# Wrap median in a delayed object
x = delayed(median)

# You can set the parameters.  Note at this point the function does not
# get executed since it # is of "delayed" type.
delayed_args(x) <- list(c(1,2,3))

# To force execution and get the result call `compute()`
print(compute(x, force_all_local=TRUE))

Visualization

Any task graph created using the delayed API can be visualized with visualize(). The graph will be auto-updated by default as the computation progresses. If you wish to disable auto-updating, then simply set auto_update=False as a parameter to visualize(). If you are inside a Jupyter notebook, the graph will render as a widget. If you are not on the notebook, you can set notebook=False as a parameter to render in a normal Python window.

res.visualize()

Retrying Functions

If a function fails or you cancel it, you can manually retry the given node with the .retry method, or retry all failed nodes in a DAG with .retry_all(). Each retry call retries a node once.

from tiledb.cloud.compute import Delayed

# Retry one node:

flaky_node = Delayed(flaky_func)(my_data)
final_output = Delayed(process_output)(flaky_node)

data = final_output.compute()
# -> Raises an exception since flaky_node failed.

flaky_node.retry()
data = final_output.result()

# Retry many nodes:

flaky_inputs = [Delayed(flaky_func)(shard) for shard in input_shards]
combined = Delayed(merge_outputs)(flaky_inputs)

combined.compute()
# -> Raises an exception since some of the flaky inputs failed.

combined.dag.retry_all()
combined.dag.wait()

data = combined.result()

Canceling Task Graph

If you have a task graph that is running, you can cancel it with the the .cancel() function on the dag or delayed object.

import tiledb.cloud

def my_func():
  import time
  time.sleep(120)
  return

batch_dag = tiledb.cloud.dag.DAG(mode=tiledb.cloud.dag.Mode.BATCH)
result = batch_dag.submit(my_func)
# Start task graph
batch_dag.compute()


# Cancel Task Graph
batch_dag.cancel()

Advanced Usage

Manually Setting Delayed Task Dependencies

There are cases where you might have one function to depend on another without using its results directly. A common case is when one function manipulates data stored somewhere else (on S3 or a database). To facilitate this, we provide function depends_on.

# A few base functions:
import random
from tiledb.cloud.compute import Delayed

# Set three initial nodes
node_1 = Delayed(numpy.median, local=True, name="node_1")([1, 2, 3])
node_2 = Delayed(lambda x: x * 2, local=True, name="node_2")(node_1)
node_3 = Delayed(lambda x: x * 2, local=True, name="node_3")(node_2)

# Create a dictionary to hold the nodes so we can ranodmly pick dependencies
nodes_by_name= {'node_1': node_1, 'node_2': node_2, 'node_3': node_3}

#Function which sleeps for some time so we can see the graph in different states
def f():
    import time
    import random
    time.sleep(random.randrange(0, 30))
    return x

# Randomly add 96 other nodes to the graph. All of these will use the sleep function
for i in range(4, 100):
    name = "node_{}".format(i)
    node = Delayed(f, local=True, name=name)()
    
    dep = random.randrange(1, i-1)
    # Randomly set dependency on one other node
    node_dep = nodes_by_name["node_{}".format(dep)]
    # Force the dependency to be set
    node.depends_on(node_dep)
    
    nodes_by_name[name] = node

# You can call visualize on any member node and see the whole graph
node_1.visualize()

# Get the last function's results
node_99 = nodes_by_name["node_99"]
node_99.compute()

The above code, after the call to node_1.visualize(), produces a task graph similar to that shown below:

Low Level Task Graph API

A lower level Task Graph API is provided which gives full control of building out arbitrary task graphs.

from tiledb.cloud.dag import dag
import numpy as np

# This is the same implementation which backs `Delayed`, but this interface
# is better suited more advanced use cases where full control is desired.
graph = dag.DAG()

# Define a graph 
# Note that package numpy is aliased as np in the UDFs
local = graph.submit(lambda x: x * 2, local=True, 100)
array_apply = graph.submit_array_udf(lambda x: np.sum(x["a"]),
       "tiledb://TileDB-Inc/quickstart_sparse", 
       name="array_apply", ranges=[(1, 4), (1, 4)])
sql = graph.submit_sql("select SUM(`a`) as a from `tiledb://TileDB-Inc/quickstart_dense`"), name="sql")

# Custom function for averaging all the results we are passing in
def mean(local, array_apply, sql):
    return np.mean([local, array_apply, sql.iloc(0)[0]])

# This is essentially a task graph that looks like
#                 mean
#          /       |      \
#         /        |       \    
#      local  array_apply  sql 
#
# The `local`, `array_apply` and `sql` tasks will computed first,
# and once all three are finished, `mean` will computed on their results
res = graph.submit(func_exec=mean, name="node_exec", local=local, array_apply=array_apply, sql=sql)
graph.compute()
graph.wait()

print(res.result())

Selecting Who To Charge

If you are a member of an organization, then by default the organization is changed for your Delayed tasks. If you would like to charge the task to yourself, you just need to add one extra argument namespace.

import tiledb.cloud

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

res = DelayedSQL("select `rows`, AVG(a) as avg_a from `tiledb://TileDB-Inc/quickstart_dense` GROUP BY `rows`"
      namespace ="my_username", # who to charge the query to
      )

# When using the Task Graph API set the namespace on the DAG object
dag = tiledb.cloud.dag.DAG(namespace="my_username")
dag.submit_sql("select `rows`, AVG(a) as avg_a from `tiledb://TileDB-Inc/quickstart_dense` GROUP BY `rows`")

library (tiledbcloud)

tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")
     
res <- delayed_sql(
    query="select `rows`, AVG(a) as avg_a from `tiledb://TileDB-Inc/quickstart_dense` GROUP BY `rows`",
    namespace=namespace)
out <- compute(res, namespace) # You can also put the namespace here
str(out)

You can also set who to charge for the entire task graph instead of individual Delayed objects. This is often useful when building a large task graph, to avoid having to set the extra parameter on every object. Taking the example above, you just pass namespace="my_username" to the compute call.

import tiledb.cloud.compute
import numpy

# Build several delayed objects to define a graph
local = Delayed(lambda x: x * 2, local=True)(100)
array_apply = DelayedArrayUDF("tiledb://TileDB-Inc/quickstart_sparse", 
        lambda x: numpy.sum(x["a"]), name="array_apply")([(1, 4), (1, 4)])
sql = DelayedSQL("select SUM(`a`) as a from `tiledb://TileDB-Inc/quickstart_dense`"), name="sql")

# Custom function to use to average all the results we are passing in
def mean(local, array_apply, sql):
    import numpy
    return numpy.mean([local, array_apply, sql.iloc(0)[0]])

res = Delayed(func_exec=mean, name="node_exec")(local, array_apply, sql)

# Set all tasks to run under your username
print(res.compute(namespace="my_username"))

library(tiledbcloud)

# Build several delayed objects to define a graph

# Locally executed; simple enough
local = delayed(function(x) { x*2 }, local=TRUE)
delayed_args(local) <- list(100)

# Array UDF -- we specify selected ranges and attributes, then do some R on the
# dataframe which the UDF receives
array_apply <- delayed_array_udf(
  array="TileDB-Inc/quickstart_dense",
  udf=function(df) { sum(as.vector(df[["a"]])) },
  selectedRanges=list(cbind(1,4), cbind(1,4)),
  attrs=c("a")
)

# SQL -- note the output is a dataframe, and values are all strings (MariaDB
# "decimal values") so we'll cast them to numeric later
sql = delayed_sql(
  "select SUM(`a`) as a from `tiledb://TileDB-Inc/quickstart_dense`",
  name="sql"
)

# Custom function for averaging all the results we are passing in
ourmean <- function(local, array_apply, sql) {
    mean(c(local, array_apply, sql))
}

# This is essentially a task graph that looks like
#               ourmean
#          /       |      \
#         /        |       \    
#      local  array_apply  sql 
#
# The `local`, `array_apply` and `sql` tasks will computed first,
# and once all three are finished, `ourmean` will computed on their results.
# Note here we slot out the ansswer from the SQL dataframe using `[[...]]`,
# and also cast to numeric.
res <- delayed(ourmean, args=list(local, array_apply, as.numeric(sql[["a"]])))
print(compute(res, namespace=namespace, verbose=TRUE))

Access Object Stores

Batch task graphs support the ability to use an registered access credential inside of task to provide access to an object store. This is commonly used for ingestion and exporting. TileDB Cloud supports allowing the use of AWS IAM roles or Azure SAS tokens for access. Your administrator needs to explicitly enable "allow in batch tasks" on the credential.

# Create batch dag
dag = tiledb.cloud.dag.DAG(mode=tiledb.cloud.dag.Mode.BATCH)

# Submit function with role to be assumed for task
dag.submit(numpy.median, access_credentials_name="my_role")

Controlling the Number of Realtime Workers

Realtime Task Graphs are driven by the client. The client dispatches each task as a separate request and potentially will fetch and return results. These requests are all in parallel and the maximum number of requests is controlled by defining how many threads are allowed to execute. This defaults to min(32, os.cpu_count() + 4) in python. A function is provided to global configure this and allow a larger number of parallel requests and downloading of results to the client.

tiledb.cloud.client.client.set_threads(100)

Resource Specification

Batch task graphs allow you to specify resource requirements for CPU, Memory and GPUs for every individual task. In TileDB Cloud SaaS, GPUs leverage Nvidia V100 GPUs.

Resources can be passed directly to any of the Delayed or Task Graph submission APIs.

Delayed API

Delayed(numpy.median, mode=tiledb.cloud.dag.Mode.BATCH, resources={"cpu": "6", "memory": "12Gi", "gpus": "0"}

delayed(median, mode=tiledbcloud::BATCH, resources={"cpu": "6", "memory": "12Gi", "gpus": "0"})

Task Graph API

# Create batch dag
dag = tiledb.cloud.dag.DAG(mode=tiledb.cloud.dag.Mode.BATCH)

# Submit function specifying resoures
dag.submit(numpy.median, resources={"cpu": "6", "memory": "12Gi", "gpus": "0"})

# Create batch dag
dag = tiledbcloud::DAG(mode=tiledbcloud::BATCH)

# Submit function specifying resoures
submit(dag, median, resources={"cpu": "6", "memory": "12Gi", "gpus": "0"})

Utilities

The TileDB Cloud client offers several useful utilities. To use them, you must have the client installed (see Installation).

TileDB Cloud allows you to login (with your username/password or API token) in a way such that the session token can be cached to avoid logging in again for every program execution. This is done as follows:

# With username/password
tiledb.cloud.login(username='xxx', password='xxx')

# Or, with token
tiledb.cloud.login(token='xxx')

# With username/password
tiledbcloud::login(username='xxx', password='xxx')

# Or, with token
tiledbcloud::login(api_key='xxx')

// Use the api token for the Java client. You can leave username and password as null.
TileDBClient tileDBClient = new TileDBClient(
        new TileDBLogin("username",
                "password",
                "<TILEDB_API_TOKEN>",
                true,
                true,
                true));

After logging in for the first time, the TileDB Cloud client will store a session token in configuration file $HOME/.tiledb/cloud.jsoncreated in your home directory.

Retry Settings

The TileDB Cloud clients have the ability to retry failed HTTP requests automatically. By default this is enabled for retrying when TileDB Cloud indicates there is not enough capacity for the request (HTTP 503 errors). For convenience we also offer the ability to disable retries or to enable more forceful retry settings.

Built in modes

# Set default retry for only retrying "not enough capacity" responses
tiledb.cloud.client.client.retry_mode("default")

# Set do not retry any requests
tiledb.cloud.client.client.retry_mode("disabled")

# Retry for a large number of scenarios
tiledb.cloud.client.client.retry_mode("forceful")

In "forceful" mode it is possible that the client might retry requests which will always fail, such as when there is a syntax error in a SQL query. This mode should be used with care to avoid increased costs from retrying.

All built in modes (besides disabled) will retry a request up to 10 times.

Custom Retry Logic

It is also possible to manually set retry conditions to suite your needs.

from urllib3 import Retry

# Set the retries to a urllib3 retry object
tiledb.cloud.config.config.retries=Retry(
        total=10,
        backoff_factor=0.25,
        status_forcelist=[400, 500, 501, 502, 503],
        allowed_methods=["HEAD","GET","PUT","DELETE","OPTIONS","TRACE","POST","PATCH",],
        raise_on_status=False,
        remove_headers_on_redirect=[],
    )
    
# After updating the config make sure to update the package level clients
tiledb.cloud.client.client.update_clients()

Context and Config

There are two helper functions that allow to easily create a tiledb config or context that has the proper configuration needed for slicing arrays through TileDB Cloud.

# Create a TileDB Config object with `rest.token` set from the login
cfg = tiledb.cloud.Config()

# Create a TileDB Context which has a config with `rest.token` set from the login
ctx = tiledb.cloud.Ctx()

# Create a TileDB Config object with `rest.token` set from the login
config <- tiledb_config()

# Create a TileDB Context which has a config with `rest.token` set from the login
ctx <- tiledb_ctx(config)

Viewing the User Profile

You can see your user profile as follows:

prof = tiledb.cloud.user_profile()
print(prof)

prof <- tiledbcloud::user_profile()
print(prof)

UserApi apiInstance = new UserApi(defaultClient);
try {
    User result = apiInstance.getUser();
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling UserApi#getUser");
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

Listing Arrays

You can list arrays from the cloud service, passing a variety of filters:

# List all arrays you own
owned_arrays = tiledb.cloud.list_arrays()

print(owned_arrays)

# List all arrays that are shared with you
shared_arrays = tiledb.cloud.list_shared_arrays()
print(shared_arrays)

# List all public arrays
public_arrays = tiledb.cloud.list_public_arrays()
print(public_arrays)

# List arrays in a specific namespace
tiledb_inc_arrays = tiledb.cloud.list_arrays(namespace="TileDB-Inc")
print(tiledb_inc_arrays)

# Filter arrays to only those you have read and write permissions to
rw_arrays = tiledb.cloud.list_arrays(permissions=["read", "write"])
print(rw_arrays)

# You can combine filters
arrays = tiledb.cloud.list_arrays(namespace="TileDB-Inc", permissions=["read"])
print(arrays)

# Search keywords
arrays = tiledb.cloud.list_public_arrays(namespace="TileDB-Inc", search="nyc")
print(arrays)

# List specific asset types that are based upon Arrays. See also Listing Groups
# Available asset types:
# FileType.FILE
# FileType.USER_DEFINED_FUNCTION
# FileType.REGISTERED_TASK_GRAPH
# FileType.ML_MODEL
# FileType.NOTEBOOK
assets = tiledb.cloud.list_public_arrays(namespace="TileDB-Inc", file_type=[tiledb.cloud.rest_api.FileType.NOTEBOOK, tiledb.cloud.rest_api.FileType.ML_MODEL])
print(assets)

# List all arrays you own
owned_arrays <- tiledbcloud::list_arrays()
str(owned_arrays)

# List all arrays that are shared with you
shared_arrays <- tiledbcloud::list_arrays(shared=TRUE)
str(shared_arrays)

# List all public arrays
public_arrays <- tiledbcloud::list_arrays(public=TRUE)
str(public_arrays)

# List arrays in a specific namespace
tiledb_inc_arrays = tiledbcloud::list_arrays(namespace="TileDB-Inc")
str(tiledb_inc_arrays)

String namespace = "<TILEDB_NAMESPACE>"; // String | namespace array is in (an organization name or user's username)
try {
    //get arrays in namespace
    List<ArrayInfo> result = apiInstance.getArraysInNamespace(namespace);
} catch (ApiException e) {
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

Getting Array Information

You can run the following to get basic information about the array, such as its description:

info = tiledb.cloud.info("tiledb://TileDB-Inc/quickstart_sparse")
print(info)

info <- tiledbcloud::array_info(namespace="TileDB-Inc", arrayname="quickstart_sparse")
str(info)

String namespace = "TileDB-Inc"; // String | namespace array is in (an organization name or user's username)
String array = "quickstart_sparse"; // String | name/uri of array that is url-encoded
try {
    ArrayInfo result = apiInstance.getArrayMetadata(namespace, array);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ArrayApi#getArrayMetadata");
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

Array Activity

Array activity can be fetched programmatically as follows:

activity = tiledb.cloud.array_activity("tiledb://TileDB-Inc/quickstart_sparse")
print(activity)

String namespace = "<NAMESAPCE>"; // String | namespace array is in (an organization name or user's username)
String array = "<ARRAY_NAME>"; // String | name/uri of array that is url-encoded
Integer start = null; // Integer | Start time of window of fetch logs, unix epoch in seconds (default: seven days ago)
Integer end = null; // Integer | End time of window of fetch logs, unix epoch in seconds (default: current utc timestamp)
String eventTypes = null; // String | Event values can be one or more of the following read, write, create, delete, register, deregister, comma separated
String taskId = null; // String | Array task ID To filter activity to
Boolean hasTaskId = false; // Boolean | Excludes activity log results that do not contain an array task UUID
try {
    List<ArrayActivityLog> result = apiInstance.arrayActivityLog(namespace, array, start, end, eventTypes, taskId, hasTaskId);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ArrayApi#arrayActivityLog");
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

Listing Tasks

You can list tasks from the cloud service, passing a variety of filters:

# List all tasks
all_tasks = tiledb.cloud.fetch_tasks()
print(all_tasks)

# List only tasks on a specific array
array_tasks = tiledb.cloud.fetch_tasks(array="tiledb://TileDB-Inc/quickstart_sparse")
print(array_tasks)

# Lists tasks within a specific time period
import datetime
ninety_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=90)
datetime_tasks = tiledb.cloud.fetch_tasks(
    array="tiledb://TileDB-Inc/quickstart_sparse",
    start=ninety_days_ago
)
print(datetime_tasks)

# Filter tasks by status, valid statuses are RUNNING, FAILED, COMPLETED
running_tasks = tiledb.cloud.fetch_tasks(status="RUNNING")
print(running_tasks)

TasksApi apiInstance = new TasksApi(defaultClient);
String namespace = "<NAMESPACE"; // String | namespace to filter
String createdBy = null; // String | username to filter
String array = "<ARRAY_URI>"; // String | name/uri of array that is url-encoded to filter
Integer start = null; // Integer | start time for tasks to filter by
Integer end = null; // Integer | end time for tasks to filter by
Integer page = null; // Integer | pagination offset
Integer perPage = null; // Integer | pagination limit
String type = null; // String | task type, \"QUERY\", \"SQL\", \"UDF\", \"GENERIC_UDF\"
List<String> excludeType = Arrays.asList(); // List<String> | task_type to exclude matching array in results, more than one can be included
List<String> fileType = Arrays.asList(); // List<String> | match file_type of task array, more than one can be included
List<String> excludeFileType = Arrays.asList(); // List<String> | exclude file_type of task arrays, more than one can be included
String status = null; // String | Filter to only return these statuses
String search = null; // String | search string that will look at name, namespace or description fields
String orderby = null; // String | sort by which field valid values include start_time, name
try {
    ArrayTaskData result = apiInstance.tasksGet(namespace, createdBy, array, start, end, page, perPage, type, excludeType, fileType, excludeFileType, status, search, orderby);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling TasksApi#tasksGet");
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

For convenience, you can also see the last SQL or UDF task:

# Get last SQL task
tiledb.cloud.last_sql_task()


# Get last UDF task
tiledb.cloud.last_udf_task()

Or you can get a specific task with a given task ID (which can be found on the UI console):

task = tiledb.cloud.task(id='xxx')

Registering an Array

In addition to registering S3-stored TileDB arrays with TileDB cloud via the console, you can also do it programmatically as follows:

tiledb.cloud.register_array(uri="s3://mybucket/myarray",
                namespace="user1", # Optional, you may register it under your username, or one of your organizations
                array_name="myarray",
                description=None,  # Optional 
                access_credentials_name="myCredentials") # You must have already added your AWS credentials on the console

String namespace = "<NAMESPACE>"; // String | namespace array is in (an organization name or user's username)
String array = "s3://<S3_BUCKET>/<ARRAY_NAME>"; // String | name/uri of s3 array that is url-encoded
ArrayInfoUpdate arrayMetadata = new ArrayInfoUpdate(); // ArrayInfoUpdate | metadata associated with array
arrayMetadata.setUri("s3://<S3_BUCKET>/<ARRAY_NAME>");
arrayMetadata.setName("<ARRAY_NAME>");
try {
    ArrayInfo result = arrayApi.registerArray(namespace, array, arrayMetadata);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ArrayApi#registerArray");
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

Deregistering an Array

You can deregister an array as follows:

tiledb.cloud.deregister_array("tiledb://user1/myarray")

ArrayApi apiInstance = new ArrayApi(defaultClient);
String namespace = "<NAMESPACE>"; // String | namespace array is in (an organization name or user's username)
String array = "<ARRAY_NAME>"; // String | name/uri of array that is url-encoded
try {
  apiInstance.deregisterArray(namespace, array);
} catch (ApiException e) {
  System.err.println("Exception when calling ArrayApi#deregisterArray");
  System.err.println("Status code: " + e.getCode());
  System.err.println("Reason: " + e.getResponseBody());
  System.err.println("Response headers: " + e.getResponseHeaders());
  e.printStackTrace();
}

Deregistering an array will not physically delete it.

You can programmatically share a registered array, "unshare" a registered array (i.e., revoke access) and list array sharing information as follows:

# Share an array with both read and write permissions with a user
tiledb.cloud.share_array(uri="tiledb://user1/myarray",
                         namespace="user1", # The user to share the array with
                         permissions=["read", "write"])

# Revoke access to an array for a particular user                         
tiledb.cloud.unshare_array(uri="tiledb://user1/myarray", namespace="user1")                      

# Get sharing information about an array
shared_with = tiledb.cloud.list_shared_with("tiledb://user1/myarray")
print(shared_with)

ArrayApi apiInstance = new ArrayApi(defaultClient);
String namespace = "namespace_example"; // String | namespace array is in (an organization name or user's username) to share the array with.
String array = "array_example"; // String | name/uri of array that is url-encoded
ArraySharing arraySharing = new ArraySharing(); // ArraySharing | Namespace and list of permissions to share with. An empty list of permissions will remove the namespace; if permissions already exist they will be deleted then new ones added. In the event of a failure, the new policies will be rolled back to prevent partial policies, and it's likely the array will not be shared with the namespace at all.
arraySharing.addActionsItem(ArrayActions.READ); //enable read permissions.
try {
  //share an array with read persmissions.
  apiInstance.shareArray(namespace, array, arraySharing);
  
  //to unshare an array use an empy arraySharing object
  arraySharing = null;
  apiInstance.shareArray(namespace, array, arraySharing);
  
  //Get sharing information about an array
  List<ArraySharing> result = apiInstance.getArraySharingPolicies(namespace, array);
    
} catch (ApiException e) {
  System.err.println("Status code: " + e.getCode());
  System.err.println("Reason: " + e.getResponseBody());
  System.err.println("Response headers: " + e.getResponseHeaders());
  e.printStackTrace();
}

Invite to Array

Similar to Sharing Arrays, you can invite users to an array as follows:

tiledb.cloud.invites.invite_to_array(
    "tiledb://user1/myarray",
    recipients=[<tiledb-username1>, <tiledb-username2>, <email>, ...],
    actions=["READ", "WRITE", ...]
)

recipients can include any combination of TileDB usernames and email addresses.
actions allowed values are: READ, WRITE, EDIT, READ_ARRAY_LOGS, READ_ARRAY_INFO, READ_ARRAY_SCHEMA

You can cancel an invitation to an array as follows:

tiledb.cloud.invites.cancel_invite_to_array(
    "tiledb://user1/myarray", 
    invitation_id=<invitation_id>
)

invitation_id can be retrieved using List invitations

Region Redirection

When accessing an array or group via the API, your request will be automatically routed to the instance closest to the data. If you already know the region, a compute region can be accessed directly with a configured parameter to manually bypass automatic redirection. Manually specifying the region can be helpful if you want to avoid the slight increase in latency that the redirection adds.

To access a region directly the domain is of the scheme: <region>.aws.api.tiledb.com

The five domains we currently support are:

us-east-1.aws.api.tiledb.com
us-west-2.aws.api.tiledb.com
eu-west-1.aws.api.tiledb.com
eu-west-2.aws.api.tiledb.com
ap-southeast-1.aws.api.tiledb.com

You can manually set the domain to send a request directly to a region as follows:

import tiledb, tiledb.sql
import pandas

# Create the configuration parameters
config = tiledb.Config()
config["rest.username"] = "xxx"
config["rest.password"] = "yyy"
# or, more preferably, config["rest.token"] = "my_token"

# Manually set the server address to the redirection URL
config["rest.server_address"] = "https://eu-west-2.aws.api.tiledb.com"

# This is the array URI format in TileDB Cloud
array_name = "tiledb://TileDB-Inc/quickstart_sparse-eu-west-2"

# Write code exactly as in TileDB Developer
with tiledb.open(array_name, 'r', ctx=tiledb.Ctx(config)) as A:
    print (A.df[:])
    
# Using embedded SQL, you need to pass the username/password 
# as config parameters as well as the server address in `init_command`
db = tiledb.sql.connect(db="test",
        init_command="set mytile_tiledb_config='rest.username=xxx,rest.password=xxx,rest.server_address=https://eu-west-2.aws.api.tiledb.com'")
pandas.read_sql(sql="select * from `tiledb://TileDB-Inc/quickstart_sparse-eu-west-2`", con=db)

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="xxx", password="yyy", host="https://eu-west-2.aws.api.tiledb.com")
# or tiledb.cloud.login(token="my_token", host="https://eu-west-2.aws.api.tiledb.com")

with tiledb.open("tiledb://TileDB-Inc/quickstart_sparse-eu-west-2", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarray [1,2]x[1,2]
    res = A.apply(median, [(1,2), (1,2)], attrs = ["a"])
    print(res)

//set the config
Config config = new Config();
config.set("rest.username", "xxx");
config.set("rest.password", "yyy");
config.set("rest.server_address", "https://eu-west-2.aws.api.tiledb.com")
Context ctx = new Context(config);

//open the array as you would normally do in the Java API.
Array array = new Array(ctx, "tiledb://TileDB-Inc/quickstart_sparse-eu-west-2");

//print the array schema
System.out.println(array.getSchema());

Files

TileDB Cloud has the ability to convert files to and from the TileDB File representation. This allows you to store any arbitrary file as a 1 dimensions dense array. Importing and exporting to and from the original file format is supported directly through TileDB Cloud. The file-arrays can be stored on an object store, such as S3, directly.

# Import from s3 to a TileDB array,
# automatically registering it with TileDB Cloud
tiledb.cloud.files.utils.create_file(
    namespace="my_organization",
    name="my_file", # optional name to set for registered file
    input_uri="s3://my_bucket/files/my_file.pdf",
    output_uri="s3://my_bucket/files/arrays/my_file"
)

# Export back to S3 in the original format
# The export happens completely in TileDB cloud
tiledb.cloud.files.utils.export_file(
    uri="tiledb://my_organization/my_file",
    output_uri="s3://my_bucket/files/arrays/my_file"
)

# Export back to local filesystem in the original format
tiledb.cloud.files.utils.export_file(
    uri="tiledb://my_organization/my_file",
    output_uri="my_file_exported.pdf",
)

//Use the TileDB-Java package

// Set up the config with your TileDB-Cloud credentials
Config config = new Config();
// For s3 access
config.set("vfs.s3.aws_access_key_id", "<ID>");
config.set("vfs.s3.aws_secret_access_key", "<KEY>");

// For TileDB-Cloud access.
// You can either use rest.username and rest.password
config.set("rest.username", "<USERNAME>");
config.set("rest.password", "<PASSWORD>");

// Or rest.token
config.set("rest.token", "<TOKEN>");

Context ctx = new Context(config);

// Create the array schema of an array based on the file to be saved
ArraySchema arraySchema = FileStore.schemaCreate(ctx, "<FILENAME>");

// Create a TileDB array with the schema
Array.create("tiledb://<NAMESPACE_NAME>/s3://<BUCKET_NAME>/<ARRAY_NAME>", arraySchema);

// Import the file to be saved to the TileDB array
FileStore.uriImport(
    ctx,
    "tiledb://<NAMESPACE_NAME>/<ARRAY_NAME>",
    "<FILENAME>",
    MimeType.TILEDB_MIME_AUTODETECT);

// Export/download the file from TileDB and save it with a given name.
FileStore.uriExport(ctx, "tiledb://<NAMESPACE_NAME>/<ARRAY_NAME>", "<OUTPUT_FILENAME>");

Registering a Group

In addition to registering S3-stored TileDB groups with TileDB cloud via the console, you can also do it programmatically as follows:

tiledb.cloud.groups.register("s3://mybucket/mygroup",
                namespace="user1", # Optional, you may register it under your username, or one of your organizations
                name="mygroup",
                description=None,  # Optional 
                credentials_name="myCredentials") # You must have already added your AWS credentials on the console

String namespace = "<NAMESPACE>"; // String | namespace array is in (an organization name or user's username)
String array = "s3://<S3_BUCKET>/<ARRAY_NAME>"; // String | name/uri of s3 array that is url-encoded
GroupInfoUpdate groupInfo = new GroupInfoUpdate(); // ArrayInfoUpdate | metadata associated with array
groupInfo.setUri("s3://<S3_BUCKET>/<ARRAY_NAME>");
groupInfo.setName("<ARRAY_NAME>");
try {
    GroupInfo result = arrayApi.registerGroup(namespace, array, groupInfo);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ArrayApi#registerArray");
    System.err.println("Status code: " + e.getCode());
    System.err.println("Reason: " + e.getResponseBody());
    System.err.println("Response headers: " + e.getResponseHeaders());
    e.printStackTrace();
}

Deregistering a Group

You can deregister an group as follows:

tiledb.cloud.groups.deregister("tiledb://user1/mygroup", recursive=False)

GroupApi apiInstance = new GroupApi(defaultClient);
String namespace = "<NAMESPACE>"; // String | namespace array is in (an organization name or user's username)
String array = "<ARRAY_NAME>"; // String | name/uri of array that is url-encoded
try {
  apiInstance.deregisterGroup(namespace, array);
} catch (ApiException e) {
  System.err.println("Exception when calling GroupApi#deregisterGroup");
  System.err.println("Status code: " + e.getCode());
  System.err.println("Reason: " + e.getResponseBody());
  System.err.println("Response headers: " + e.getResponseHeaders());
  e.printStackTrace();
}

Deregistering a group will not physically delete it.

Listing Group

You can list arrays from the cloud service, passing a variety of filters:

# List all groups you own
owned_groups = tiledb.cloud.list_groups()

print(owned_groups)

# List all arrays that are shared with you
shared_arrays = tiledb.cloud.list_shared_groups()
print(shared_groups)

# List all public groups
public_groups = tiledb.cloud.list_public_groups()
print(public_groups)

# List arrays in a specific namespace
tiledb_inc_groups = tiledb.cloud.list_groups(namespace="TileDB-Inc")
print(tiledb_inc_groups)

# Search keywords
groups = tiledb.cloud.list_public_groups(namespace="TileDB-Inc", search="dragen")
print(groups)

# You can combine filters
groups = tiledb.cloud.list_public_groups(namespace="TileDB-Inc", tag="genomics", search="dragen")
print(groups)

# List specific asset types that are based upon Groups. See also Listing Arrays
# Available asset types:
# GroupType.BIOIMG
# GroupType.SOMA
# GroupType.VCF
# GroupType.POINTCLOUD
# GroupType.RASTER
assets = tiledb.cloud.list_public_groups(namespace="TileDB-Inc", group_type=tiledb.cloud.rest_api.GroupType.SOMA)
print(assets)

Getting Group Information

You can run the following to get basic information about the array, such as its description:

info = tiledb.cloud.groups.info("tiledb://TileDB-Inc/vcf-1kghicov-dragen-v376")
print(info)

Invite to a Group

You can invite users to a group as follows:

tiledb.cloud.invites.invite_to_group(
    "tiledb://user1/mygroup", 
    recipients=[<tiledb-username1>, <tiledb-username2>, <email>, ...],
    array_actions=["READ", ...],
    group_actions=["WRITE", ...]
)

recipients can include any combination of TileDB usernames and email addresses.
array_actions allowed values are: READ, WRITE, EDIT, READ_ARRAY_LOGS, READ_ARRAY_INFO, READ_ARRAY_SCHEMA
group_actions allowed values are: READ, WRITE, EDIT

You can cancel an invitation to a group as follows:

tiledb.cloud.invites.cancel_invite_to_group(
    "tiledb://user1/mygroup", 
    invitation_id=<invitation_id>
)

invitation_id can be retrieved using List invitations

Invite to Organization

You can invite users to an organization as follows:

tiledb.cloud.invites.invite_to_organization(
    "Organization_Name", 
    recipients=[<tiledb-username1>, <tiledb-username2>, <email>, ...],
    role="READ_WRITE"
)

recipients any combination of TileDB usernames and email addresses.
role can be one of the following values: OWNER, ADMIN, READ_WRITE, READ_ONLY

Accept an Invite

You can accept an invite by its ID as follows:

tiledb.cloud.invites.accept_invitation(<invitation_id>)

invitation_id can be retrieved using List invitations

List invitations

You can fetch a paginated list of invitations as follows:

tiledb.cloud.invites.fetch_invitations(**filters)

Available Filters

organization: name or ID of organization to filter
array: name/uri of array that is url-encoded to filter
group: name or ID of group to filter
start: start time for tasks to filter by
end: end time for tasks to filter by
page: pagination offset
per_page: pagination limit
type: invitation type, "ARRAY_SHARE", "JOIN_ORGANIZATION"
status: Filter to only return "PENDING", "ACCEPTED"
orderby: sort by which field valid values include

Serverless Array UDFs

Basic Usage

Below we show how to use Python/R UDFs in TileDB Cloud, with an example that computes the median on the values of attribute a on a slice of a 2D dense array.

For Python, you just need to write your function (median in this example) that takes as input an ordered numpy dictionary, i.e., in the form {"a" : <numpy-array>, "b" : <numpy-array>, ...}, where the keys are attribute or dimension names of the array you are querying. The reason is that this function will be applied on an array slice; recall that the Python API of TileDB returns an ordered dictionary of numpy arrays on each attribute and dimension upon a read. Then you just use the apply function of the TileDB Cloud client, which takes as input your function, a slice, and optionally a list of attributes (default is all attributes). Note that only the selected attributes must appear in the ordered dictionary that you provide as input to your function.

For R, the story is similar: you just need to write your function that takes a data frame as input.

For Java, you can run an existing Python or R UDF. In this case we use a UDF that scales the values of the input argument.

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarray [1,2]x[1,2]
    res = A.apply(median, [(1,2), (1,2)], attrs = ["a"])
    print(res)

library(tiledbcloud)

myfunc <- function(df) {
  median(as.vector(df[["a"]]))
}

tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")

res <- tiledbcloud::execute_array_udf(
  array="TileDB-Inc/quickstart_dense",
  udf=myfunc,
  selectedRanges=list(cbind(1,2), cbind(1,2)),
  attrs=c("a")
)
print(res)

// Login by using a TileDBLogin object
TileDBClient tileDBClient = new TileDBClient(
new TileDBLogin(null,
        null,
        "<TILEDB_API_TOKEN>",
        true,
        true,
        true));
        
        
// Create a TileDBUDF object
TileDBUDF tileDBUDF = new TileDBUDF(tileDBClient, "TileDB-Inc");

// add the query ranges
ArrayList<BigDecimal> range1 = new ArrayList<>();
range1.add(BigDecimal.valueOf(1));
range1.add(BigDecimal.valueOf(4));

ArrayList<BigDecimal> range2 = new ArrayList<>();
range2.add(BigDecimal.valueOf(1));
range2.add(BigDecimal.valueOf(4));

QueryRanges queryRanges = new QueryRanges();
queryRanges.addRangesItem(range1);
queryRanges.addRangesItem(range2);

// add the arguments for the UDF
HashMap<String,Object> argumentsForArrayUDF = new HashMap<>();
argumentsForArrayUDF.put("attr", "rows");
argumentsForArrayUDF.put("scale", 9);

// Create an array udf
MultiArrayUDF multiArrayUDF = new MultiArrayUDF();
multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");
multiArrayUDF.setRanges(queryRanges);

// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()
System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argumentsForArrayUDF, "tiledb://TileDB-Inc/quickstart_sparse", "TileDB-Inc"));

All slices provided as input to theapply function are inclusive.

Multi-Index Usage

Multi-index queries are supported when applying an array UDF. You can pass any number of tuples or slices using a list of lists syntax (one per dimension).

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
    res = A.apply(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
    print(res)

All slices provided as input to theapply function are inclusive.

Apply Without Opening The Array

To execute an array UDF, it is not always necessary to have the array opened locally. For Python, an alternative function to apply the UDF on an array URI is provided.

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
res = tiledb.cloud.array.apply("tiledb://TileDB-Inc/quickstart_dense", median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
print(res)

Asynchronous Execution

An asynchronous version of the array UDFs is available.

import tiledb, tiledb.cloud, numpy, random

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
    # res will be a future
    res = A.apply_async(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])

    # call res.get() to block on the results
    print(res.get())

library(tiledbcloud)

myfunc <- function(df) {
  median(as.vector(df[["a"]]))
}

tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")

res <- delayed_array_udf(
  array="TileDB-Inc/quickstart_dense",
  udf=myfunc,
  selectedRanges=list(cbind(1,2), cbind(1,2)),
  attrs=c("a")
)

# call compute(res) to block on the results
o <- compute(res)
print(o)

Selecting Who to Charge

If you you are a member of an organization, then by default the organization is charged for your array UDF. If you would like to charge the array UDF to yourself, you just need to add an extra argument namespace.

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarray [1,2]x[1,2]
    res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], namespace="my_username")
    print(res)

library(tiledbcloud)

myfunc <- function(df) {
  median(as.vector(df[["a"]]))
}

tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")

result <- tiledbcloud::execute_array_udf(
  array="TileDB-Inc/quickstart_dense",
  udf=myfunc,
  selectedRanges=list(cbind(1,2), cbind(1,2)),
  attrs=c("a"),
  namespace="my_username"
)
print(resul

// Login by using a TileDBLogin object
TileDBClient tileDBClient = new TileDBClient(
new TileDBLogin(null,
        null,
        "<TILEDB_API_TOKEN>",
        true,
        true,
        true));
        
        
// Create a TileDBUDF object. The second param is the namespace to be charged
TileDBUDF tileDBUDF = new TileDBUDF(tileDBClient, "TileDB-Inc");

// add the query ranges
ArrayList<BigDecimal> range1 = new ArrayList<>();
range1.add(BigDecimal.valueOf(1));
range1.add(BigDecimal.valueOf(4));

ArrayList<BigDecimal> range2 = new ArrayList<>();
range2.add(BigDecimal.valueOf(1));
range2.add(BigDecimal.valueOf(4));

QueryRanges queryRanges = new QueryRanges();
queryRanges.addRangesItem(range1);
queryRanges.addRangesItem(range2);

// add the arguments for the UDF
HashMap<String,Object> argumentsForArrayUDF = new HashMap<>();
argumentsForArrayUDF.put("attr", "rows");
argumentsForArrayUDF.put("scale", 9);

// Create an array udf
MultiArrayUDF multiArrayUDF = new MultiArrayUDF();
multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");
multiArrayUDF.setRanges(queryRanges);

// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()
System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argu

Resource Classes

Each Array UDF runs by default in an isolated environment with 2 CPUs and 2 GB of memory. You can choose an alternative runtime environment from the following list:

Name

Description

standard

2 CPUs, 2 GB RAM

large

8 CPUs, 8 GB RAM

Charges are based on the total number of CPUs selected, not on actual use.

To run a array udf in a specific environment, set the resource_class parameter to the name of the environment.

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarray [1,2]x[1,2]
    res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], resource_class="large")
    print(res)

library(tiledbcloud)

myfunc <- function(df) {
  median(as.vector(df[["a"]]))
}

tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")

result <- tiledbcloud::execute_array_udf(
  array="TileDB-Inc/quickstart_dense",
  udf=myfunc,
  selectedRanges=list(cbind(1,2), cbind(1,2)),
  attrs=c("a"),
  resource_class="large"
)
print(result)

// Login by using a TileDBLogin object
TileDBClient tileDBClient = new TileDBClient(
new TileDBLogin(null,
        null,
        "<TILEDB_API_TOKEN>",
        true,
        true,
        true));
        
        
// Create a TileDBUDF object
TileDBUDF tileDBUDF = new TileDBUDF(tileDBClient, "TileDB-Inc");

// add the query ranges
ArrayList<BigDecimal> range1 = new ArrayList<>();
range1.add(BigDecimal.valueOf(1));
range1.add(BigDecimal.valueOf(4));

ArrayList<BigDecimal> range2 = new ArrayList<>();
range2.add(BigDecimal.valueOf(1));
range2.add(BigDecimal.valueOf(4));

QueryRanges queryRanges = new QueryRanges();
queryRanges.addRangesItem(range1);
queryRanges.addRangesItem(range2);

// add the arguments for the UDF
HashMap<String,Object> argumentsForArrayUDF = new HashMap<>();
argumentsForArrayUDF.put("attr", "rows");
argumentsForArrayUDF.put("scale", 9);

// Create an array udf
MultiArrayUDF multiArrayUDF = new MultiArrayUDF();
multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");
multiArrayUDF.setRanges(queryRanges);

// Set resource class
multiArrayUDF.setResourceClass("large");

// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()
System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argumentsForArrayUDF, "tiledb://TileDB-Inc/quickstart_sparse", "TileDB-Inc"));

Registering Array UDFs

You can register an array UDF (similar to arrays) as follows:

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

tiledb.cloud.udf.register_single_array_udf(median, name="median_test", namespace="my_username")

library(tiledbcloud)

myfunc <- function(df) {
  median(as.vector(df[["a"]]))
}

tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")

tiledbcloud::register_udf(namespace="my_username", name="median_test", type="single_array", func=myfunc)

Currently, registering a UDF is only possible bia the Python or R client.

In order to be able to register a UDF you need to set up the default storage path for you and/or your organization.

Multi-Array UDFs

TileDB Cloud provides also multi-array UDFs, i.e., UDFs that are applied to more than one array.

import numpy as np
import tiledb
import tiledb.cloud

array_1 = "tiledb://TileDB-Inc/array_1"
array_2 = "tiledb://TileDB-Inc/array_2"

def median(numpy_ordered_dictionary_list):
  # When you have multiple arrays, the parameter 
  # we pass in is actually a list of ordered dictionaries. 
  # The list is in the order of the arrays you asked for.
    return (
        np.median(numpy_ordered_dictionary[0]["a"] + numpy_ordered_dictionary[1]["a"])
    )

# The following will create the list of array to take part
# in the multi-array UDF. Each has as input the array name,
# a multi-index for slicing and a list of attributes to subselect on.
array_list = tiledb.cloud.array.ArrayList()    
array_list.add(array_1, [(1, 4), (1, 4)], ["a"])
array_list.add(array_2, [(1, 2), (1, 4)], ["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

# This will execute `median` using as input the result of the
# slicing and subselection for each of the arrays in `array_list` 
res = tiledb.cloud.array.exec_multi_array_udf(median, array_list)

print("Median Multi-array UDF:\n{}\n".format(res))libr

library(tiledbcloud)

myfunc <- function(df1, df2) {
  median(as.vector(df1[["a"]])) + median(as.vector(df2[["a"]]))
}

# The following will create the list of array to take part in
# the multi-array UDF. Each has as input the array name, a
# multi-index for slicing and a list of attributes to subselect on.
details1 <- tiledbcloud::UDFArrayDetails$new(
  uri="tiledb://TileDB-Inc/quickstart_dense",
  ranges=QueryRanges$new(
    layout=Layout$new('row-major'),
    ranges=list(cbind(1,4),cbind(1,4))
  ),
  buffers=list("a")
)

details2 <- tiledbcloud::UDFArrayDetails$new(
  uri="tiledb://TileDB-Inc/quickstart_sparse",
  ranges=QueryRanges$new(
    layout=Layout$new('row-major'),
    ranges=list(cbind(1,2),cbind(1,4))
  ),
  buffers=list("a")
)

tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")

# This will execute `median` using as input the result of the
# slicing and subselection for each of the arrays in `array_list`
res <- tiledbcloud::execute_multi_array_udf(
  array_list=list(details1, details2),
  udf=myfunc,
)
print(res)

// Login by using a TileDBLogin object
TileDBClient tileDBClient = new TileDBClient(
new TileDBLogin(null,
        null,
        "<TILEDB_API_TOKEN>",
        true,
        true,
        true));
        
// Create a TileDBUDF object
TileDBUDF tileDBUDF = new TileDBUDF(tileDBClient, "TileDB-Inc");

// add query ranges
ArrayList<BigDecimal> range1 = new ArrayList<>();
range1.add(BigDecimal.valueOf(1));
range1.add(BigDecimal.valueOf(4));

ArrayList<BigDecimal> range2 = new ArrayList<>();
range2.add(BigDecimal.valueOf(1));
range2.add(BigDecimal.valueOf(4));

QueryRanges queryRanges = new QueryRanges();
queryRanges.addRangesItem(range1);
queryRanges.addRangesItem(range2);

// set name of the udf to use
MultiArrayUDF multiArrayUDF = new MultiArrayUDF();
multiArrayUDF.setUdfInfoName("TileDB-Inc/multi-array-udf");

// create a list of the arrays to participate in the udf
List<UDFArrayDetails> arrays = new ArrayList<>();

//array1
UDFArrayDetails array1 = new UDFArrayDetails();
array1.setUri("tiledb://TileDB-Inc/dense-array");
array1.setRanges(queryRanges);
array1.setBuffers(Arrays.asList("rows", "cols", "a1"));
arrays.add(array1);

//array2
UDFArrayDetails array2 = new UDFArrayDetails();
array2.setUri("tiledb://TileDB-Inc/quickstart_dense");
array2.setRanges(queryRanges);
array2.setBuffers(Arrays.asList("rows", "cols", "a"));
arrays.add(array2);

multiArrayUDF.setArrays(arrays);

// add arguments
HashMap<String,Object> arguments = new HashMap<>();
arguments.put("attr1", "a1");
arguments.put("attr2", "a");

// print result. Could also use: executeMultiArrayJSON(), executeMultiArrayJSONArray(), executeMultiArrayArrow()
System.out.println(tileDBUDF.executeMultiArray(multiArrayUDF, arguments));

You can register a multi-array UDF simply as follows:

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary_list):
  # When you have multiple arrays, the parameter 
  # we pass in is actually a list of ordered dictionaries. 
  # The list is in the order of the arrays you asked for.
    return (
        np.median(numpy_ordered_dictionary[0]["a"] + numpy_ordered_dictionary[1]["a"])
    )

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

tiledb.cloud.udf.register_multi_array_udf(median, name="median_multi_array", namespace="my_username")

library(tiledbcloud)

myfunc <- function(df1, df2) {
  median(as.vector(df1[["a"]])) + median(as.vector(df2[["a"]]))
}

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

tiledbcloud::register_udf(namespace=namespace, name="median_multi_array", type='multi_array', func=myfunc)

Currently, registering a UDF is only possible via the Python or R client.

Retry Settings

See Retry Settings.

Universal Data Management

The need to manage data has existed for decades and there are hundreds of different data management solutions available today. What is TileDB and how does it innovate in such a crowded space?

TileDB is the first and only universal data management system.

In this section we explain what we define as "universal data management" and the practical problem it solves. We start with the main motivation behind TileDB.

Motivation

TileDB started as a research project at Intel Labs and MIT in 2014, where we made some observations.

Observations

Organizations work with a lot of disparate data (beyond tables) using a large variety of tools (beyond SQL) and find it challenging to manage and analyze their data at scale.

Data
- Most database systems deal with one type of data, mostly tables.
- Most data out there is not really tabular. Look at satellite imaging, biomedical imaging, video, weather, point cloud, and many others.
- A lot of organizations maintain a lot of diverse data. For example, hospitals have clinical records, but also genomics, MRI scans, etc. Insurance companies may have weather data coupled with satellite imaging. Telecommunications companies may have customer records, in addition to LiDAR and location data.
Tools
- Teams within an organization may be using a variety of programming languages and tools. Some may use SQL, but others may prefer Python or R. Some may wish to perform data analytics, but others may want to run Machine Learning tasks or simply visualize the data.
Collaboration
- Collaboration and governance within and across organizations is very challenging when the data (and code) does not live within a centralized database that would ordinarily manage access policies and maintain logs for auditing.

Since there is no database that can tackle the above challenges, organizations often resort to building specialized data solutions in-house. This takes a lot of time, costs a lot of money, and often involves combining a lot of disparate data software which make data management even more challenging.

Having those observations in mind, we asked ourselves a few questions.

Questions

Can we build a data management system that can store, manage and analyze any data with any tool, and enable collaboration and scalable compute, while embracing future technological advancements?

Data storage
- Is there a single data structure that can capture all data? In our mind, tables are constrained, whereas key-values, documents and graphs do not seem to be efficient models for data like images, video, weather, point clouds, genomics, and more.
- Is there a way to abstract all storage in a way that the system can work on any backend (memory, cloud object store, or other)?
Common layers
- What are the common components of a "data management" solution, regardless of the application domain (e.g., storage layer, an authentication layer, APIs, access control, logging, etc.)? Could these be common in Genomics, Earth Observation, Time Series, etc?
Data access
- Can we abstract all access in a way that the system can efficiently work with any API or tool?
Collaboration
- Can we scale access control to any number of users, anywhere in the world and beyond the limits of a data center?
- Can we share code in a similar way to sharing data? In other words, can we treat code as data?
Monetization
- Can we enable users to sell data and code they share with others?
Scalability
- Can we take multi-tenancy to the extreme, without being constrained by clusters. Can we scale easily and elastically?
- Does a scalable "data infrastructure" also imply a scalable "compute infrastructure"? In other words, if we build the former, can we also gain the foundation for the latter?
Future proofness
- Is there a way to make the system future proof? That is, can we build it in a way that in the future we can easily extend it with any storage backend, any language API, any data analytics and visualization tool, any new hardware, and practically any technological advancement?

And while contemplating about the answers to these questions, we developed TileDB and introduced the concept of universal data management.

Going Universal

There is a need for rearchitecting data management from the ground up, in a universal way that is flexible and extendible to future user needs and technological advancements. Below we describe the most important aspects of TileDB Cloud that make it a universal data management system.

Universal Storage

Universal data management system starts with universal storage.

We needed a single data format and storage engine that could handle all types of data. We observed that multi-dimensional arrays constitute a great candidate for that. We can prove that any data, from biomedical imaging to genomics to SAR to tables to anything, can be modeled very efficiently with (dense or sparse) multi-dimensional arrays. Also multi-dimensional arrays are the currency of data analytics and machine learning, because a lot of advanced mathematical computations (e.g., using Linear Algebra) are applied to vectors, matrices and tensors -- in other words, multi-dimensional arrays!

We developed an efficient multi-dimensional array data format, coupled with a powerful open-source storage engine called TileDB Open Source. In a nutshell, this engine supports:

Dense and sparse multi-dimensional array support
"Columnar" format and compression
Multi-threading and parallel IO
Cloud-optimized implementation
Rapid updates and array slicing
Data versioning and time traveling
Arbitrary metadata stored along with the array data

See the TileDB Open Source docs for more details.

TileDB Cloud relies on TileDB Open Source for data (and code) storage. All code written with TileDB Open Source can be used with TileDB Cloud by changing only a few configuration parameters. This allows the users to test their code locally, and transition to the cloud offering by changing 1-2 lines of code.

Governance

Once we can represent all data on a single format, we are well positioned to define a single layer of authentication and access control, regardless for all data types and application domains.

TileDB authentication and access control works as follows:

The user stores their data in the open-source TileDB Open Source array format on some scalable shared storage backend (e.g., AWS S3).
The user owns the data, TileDB Cloud does not do any hosting. The user only registers the array with TileDB Cloud, granting authentication keys to TileDB Cloud for accessing the data.
The user can create organizations, and share data and code with other organizations and users with various access policies. There is no bound on the number of users and organizations one can share data and code with. Users can collaborate with anyone within or beyond their organization.
When data and code is accessed, TileDB Cloud is responsible for securely checking and enforcing all the appropriate access policies.

There is no need to manage IAM roles, or Apache Sentry/Range setups anymore. TileDB Cloud handles everything transparently.

In addition, TileDB Cloud allows users to make data and code public, attaching descriptions, metadata and arbitrary tags. The data and code can then be discovered and used by any other TileDB Cloud user on the planet.

Logging

All access to arrays and code is logged and can be viewed for auditing purposes. TileDB Cloud allows users to keep track of how their shared or public arrays are being used and gain valuable insights.

Interoperability

Once authentication, access control and logging is pushed all the way down to storage, all APIs and integrations can inherit it.

TileDB Cloud enjoys the extreme interoperability offered by TileDB Emebedded (i.e., numerous language APIs and tool integrations). In addition, TileDB Cloud is constantly being extended to support more languages and tools for the added cloud features it provides (e.g., see serverless compute and Jupyter notebooks).

All the APIs and integrations (existing and future ones) inherit the authentication, access control and logging functionality we built directly on top of array storage. In other words, modeling all data universally as arrays allowed us to build a single layer for authentication, access control and logging, instead of building custom support for all the data types and APIs/tools used across different applications.

Side Benefits of Universality

Building a universal data management system that can provide extreme multi-tenancy and scale requires building an entire distributed system infrastructure from scratch. Our implementation revealed additional capabilities that proved to be very valuable for scalable data analytics, ease of use, extracting monetary value from data and code and remaining relevant in the rapidly paced data technology space. Therefore, we gradually exposed these capabilities within TileDB Cloud, described below.

Serverless Compute

We outlined the following requirements around accessing arrays registered with TileDB Cloud:

Any user on the planet with appropriate access policies should be able to access data at any time.
There should be no limit on how many users can simultaneously access an array.
The user who accesses the array should not be responsible for spinning up dedicated machines.
The user who shares the array should not be responsible for spinning up dedicated machines.

The architecture we built to meet these requirements resulted in the following:

Totally "serverless" compute from the user's stand point. Any request "just works" without reserving resources in advance.
TileDB Cloud uses an elastic compute infrastructure, which automatically expands and shrinks based on user demand.
The user is charged in a pay-as-you-go fashion, and only for compute and data egress they consume.
The compute is sent to the data, respecting geographical cloud storage regions to eliminate egress cloud provider costs and maximize performance.

Once a scalable, elastic and serverless compute infrastructure is built, the possibilities around offering computational capabilities are limitless.

Users wanted to do more than just slicing arrays. For example, they wished to run advanced SQL queries and user-defined functions (UDFs), i.e., arbitrary code in Python, R or other language, potentially using external libraries and integrations, and manipulating the data efficiently, securely and inexpensively. In the most general scenario, users wanted to create task graphs, i.e., task workflows that can implement sophisticated distributed algorithms to take advantage of the computational power and ease of use of TileDB Cloud. But this functionality could readily be provided by the infrastructure we built. Therefore, we optimized it and exposed it.

We also took this one step further. SQL queries and UDFs are runnable code. But it is also shareable data. We stored UDFs as TileDB arrays (recall that arrays can model any data) and we unlocked all the TileDB Cloud capabilities even for UDFs, such as sharing, logging and exploration of public code.

TileDB Cloud unifies diverse data management with diverse analytics and collaboration in a single powerful platform.

Jupyter Notebooks

A lot of data scientists find it convenient to use Jupyter notebooks for writing code and performing exploratory analysis. TileDB Cloud allows launching JupyterLab instances within the online console. The instances come with prepackaged images that include useful libraries, but the user can also install any library inside the Jupyter environment. TileDB Cloud handles this within its distributed infrastructure, without requiring the user to manually deploy servers.

Jupyter notebooks have become a standard tool for scientific analysis and reproducibility. Therefore, TileDB Cloud allows users to share notebooks in the same manner as arrays. Users can also make notebooks public, or explore notebooks that others have shared with the world.

TileDB Cloud provides an easy, efficient and inexpensive platform for scientific analysis, collaboration and reproducibility.

Marketplace

TileDB Cloud logs everything for auditing purposes. The tasks, duration, cost, resources used, user information, etc. As such, it had all the functionality needed to develop a full-fledged marketplace to allow users monetize their code and data based on the usage from other users. TileDB Cloud integrates with Stripe and handles all billing and accounting for users that wish to sell or buy data and code on a pay-as-you-go-fashion.

This is convenient for several reasons:

Sellers do not need to ship potentially huge quantities of data to buyers.
Sellers do not need to build their own infrastructure to serve data and code, as well as perform all billing and accounting.
Buyers can perform exploration and analysis on data from multiple vendors in a single platform.
Buyers do not need to download and host the potentially massive quantities they are purchasing.
A pay-as-you-go model is an alternative, more flexible model to the standard annual license model, which may be more economical for both buyers (due to scale) and sellers (due to paying only for what they use).

Future Proofness

A universal data management system can adapt to technological change.

So far we have described that TileDB Cloud is universal in the following respects:

Data: TileDB can manage any data type, and hence can support any future data type.
Storage backends: TileDB abstracts the backend layer and thus can easily add support for new backends (on the cloud, in memory, or other).
APIs and tools: TileDB is all about extreme interoperability and it is designed to easily add support for any new popular language and tool.
Deployment: TileDB is cloud and data center agnostic and therefore can be deployed anywhere.
Hardware: TileDB is being implemented in a way that can benefit from hardware accelerators, and boost performance in clusters with heterogeneous instances.
Algorithms: TileDB allows the development of any arbitrary distributed algorithm (from SQL to Linear Algebra to genomics pipelines), which can easily be shared and improved through collaboration.

To sum up, TileDB Cloud is flexible and can adapt to change throughout its lifetime in an organization's software stack. User requirements and creativity around data processing continually increase. TileDB Cloud remains valuable and relevant by evolving based on user feedback, rather than becoming obsolete.

TileDB Cloud

Welcome to TileDB Cloud!

What is TileDB Cloud?

Is TileDB Cloud for you?

Use Cases

Capabilities

Getting Started

How to Use the Docs

TileDB Cloud Enterprise

Looking for an on-prem solution?

More Help

Tutorials

Start Here!

Sign Up & Sign In

Access a Public Array

Access via the TileDB Cloud Client

Using the TileDB Cloud Console

What's Next?

Product Tour

Namespace selector

Notifications

Overview

Launch server

Assets sidebar

Asset browsers

Assets

Asset types

Overview

Preview

Schema

Metadata

Sharing

Settings

Actions

Adding assets

Activity

Profile

Serverless Compute 101

Running on TileDB Cloud

Running on your own client

Task Graphs 101

Running on TileDB Cloud

Running on your own client

Use Cases

LiDAR

LiDAR Quickstart

Running on TileDB Cloud

Running on your own client

Concepts

Serverless Compute

Console and API

TileDB Cloud Internals

Architecture

Automatic Redirection

Orchestration

UI Console

System State

REST Workers

Jupyter Notebooks

Connectivity (Firewall) Requirements

Open ID Connect

Array Access

Access Control and Logging

Serverless SQL

Serverless UDFs

Task Graphs

Jupyter Notebooks

Marketplace

How To

Account

Create API Tokens

Set Up Credentials

AWS Access Keys

AWS Assume Role

AWS S3 Bucket Policy:

Setup KMS for the target bucket

Edit Your Profile

Set Up Default Storage Path

Manage Billing

Arrays