Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This part of the tutorials is work in progress. We will soon add tutorials for each of the use cases we are working on, such as dataframes, LiDAR, genomics, biomedical imaging, satellite imaging, weather, time series, and many more. Stay tuned!
Last updated May 12th
Data management made universal
TileDB Cloud is the commercial platform built by the TileDB team that allows you and your organization to unify all types of data, automate distributed analysis and pipelines, and securely explore and share data and code, while enjoying extreme interoperability with programming languages and data science tools.
TileDB Cloud takes a radically different approach than the current data management landscape. Instead of dealing with a large number of different data formats and special-purpose databases, TileDB Cloud builds a unified data management stack, storing all types of data in a single unified format, pushing access control, logging and a growing set of computational primitives to storage, and emphasizing on integrations with every popular programming language and computational tool. And it does all that providing a 100% serverless experience to the user.
TileDB Cloud is based on the TileDB Open Source universal storage engine, which models and efficiently stores all data as (dense or sparse) multi-dimensional arrays, providing a common API and a large number of APIs and tool integrations.
TileDB Cloud is ideal for you if you struggle with:
Data storage and access
inefficient files and domain-specific formats
wrangling data across tools and languages
controlling and monitoring access to data and code
sharing data and code at extreme scale
Mixed data and workloads
different data types (e.g., dataframes, images, genomics, etc)
combination of SQL, data science and Machine Learning
Scalability and deployment
scaling analysis easily and inexpensively
setting up and monitoring machine clusters
managing numerous disparate data solutions and silos
Finding and contributing public data and code
sharing files in cloud buckets and code in repos
easily and reproducibly running code on different data
monitoring usage stats
Monetizing data and code
having to build an entire infrastructure to sell data and code
TileDB Cloud (SaaS) currently works only on AWS. We are currently working on adding multi-cloud support, namely for Azure and Google Cloud. See TileDB Cloud Enterprise if you are interested in hosting TileDB Cloud in your own environment.
There is a variety of application domains our team has expertise on and you can use TileDB Cloud for:
Geospatial
point cloud (LiDAR, SONAR, AIS)
SAR
optical imaging
Genomics
population genomics
single-cell multi-omics
Dataframes
any tabular data, accessible with various APIs as well as SQL
Time series
any data that could benefit from indexing on date/time fields
Biomedical imaging
any imaging data requiring pyramid structures
Many others
video, automotive, telecommunications, etc.
Data and code management Manage all you data (modeled as multi-dimensional arrays) and code (UDFs and notebooks) in a single platform.
Access control and sharing Securely share your data and code with access policies.
Logging and auditing See all the activity on your data and code from detailed audit logs.
Organizations Create organizations and define different access policies for data and code.
Jupyter notebooks Create, share and spin up Jupyter notebooks directly in the platform with a few clicks.
Serverless SQL and UDFs You can run any SQL query on any array, without having to provision any clusters. You can also define and register any user-defined function, as well as share it with other, which can be run in a serverless manner (similar to lambdas).
Serverless task graphs Create any pipeline or any sophisticated distributed algorithm with TileDB's task graphs. TileDB executes the various tasks in the graph in parallel respecting the dependencies, without forcing the user to create or define any clusters.
Data and code marketplace Take advantage of TileDB's full-fledged marketplace (integrating with Stripe) and monetize your data and code based on egress or CPU time.
The best way to get started is to sign up and run the Start Here! tutorial. You can find a constantly growing number of tutorials in the TUTORIALS page group found in the left navigation menu of these docs.
If you'd like to take a deep dive into the TileDB Cloud internals, you can navigate to CONCEPTS in the left navigation menu. You can also always consult the HOW TO guides and API REFERENCE.
To make it easy to understand where to find what you are looking for, the documentation is structured in the following sections:
Tutorials A series of steps to address key problems and use cases
Concepts Background information and explanation of key topics and concepts
How To Short how-to guides based on FAQ
API Reference
Technical reference to the client APIs
TileDB Cloud is available as a customer-hosted instance to address enterprise security policies and governance mandates. Learn more about TileDB Cloud Enterprise.
In case you do not find the information you need in these docs, there is a variety of channels you can get more help from:
Visit our forum and post a question
Join our Slack community and post comments
Request a feature if TileDB Cloud is missing important functionality
You can always contact us through our various communication channels
In this tutorial, you will learn:
How to access (slice) a public array
How to perform a SQL query on a public array.
How to perform serverless UDFs on a public array.
We will use public TileDB Cloud array TileDB-Inc/MBTA_Average_Monthly_Ridership_by_Mode, which stores the data from the Boston MBTA's Open Data Portal.
You can preview this tutorial as TileDB Cloud notebook (no login is needed). You can also easily launch it within the TileDB Cloud UI console, but you will need to sign up / login to do so.
You can run all the commands of this notebook in your own client. The only changes required are:
One of the most powerful feature of TileDB Cloud is that it allows users to share arrays, UDFs and notebooks at extreme scale, with anyone on the planet, and with diverse polices (e.g., read, write, read/write). There are no restrictions on the number of users data and code can be shared with.
Currently, TileDB Cloud supports access policies at the array level. However, soon it will support finer-grained access policies at the cell level.
TileDB Cloud also enables users to create organizations, in order to better manage access to their assets and manage billing. You can create any number of organizations.
TileDB Cloud maintains a global system state using MariaDB, recording all information required to know which assets belong to which users and who has access to the various assets.
TileDB Cloud logs everything: the task types, the users that initiated them, duration, cost, etc. All this information gets logged by the REST workers into the persistent and encrypted MariaDB instance. The activity can then be browsed on the TileDB Cloud UI console or retrieved programmatically using the TileDB Cloud client. Six months of logs are made available for instant retrieval. Contact us if you need longer retention or ways to perform offline audits of historical logs for your organization.
By default, sessions on TileDB Cloud will timeout after 8 hours. SSO session timeout is controlled by organizational policies.
TileDB Cloud enables the user to launch Jupyter notebooks within the UI console. It spins up Jupyter notebook instances in the Kubernetes cluster in us-east-1
. The user can install any extra packages in the notebook. The notebook server environment is destroyed on shutdown. Any extra packages installed will not persist across server instances.
Every user gets a 2GB persistent storage in an EBS volume (also in us-east-1
). This is mounted as the home directory in the notebook server. All contents in the home directory will persist across server restarts. The user does not get charged for storage!
Currently, TileDB offers two notebook server sizes:
Size
CPUs
Memory
Small
2
2GB
Large
16
60GB
As explained in the Pricing and Billing section, notebooks are charged based on the size of the notebook server and duration it is run for.
Currently notebook usage is charged either to an organization a user belongs to or, if the user is not part of an organization, to the user themselves. We plan a future improvement to allow selecting who to charge for the notebook usage.
TileDB Cloud offers three notebook images, with the following installed packages:
Basic Data Science:tiledb, libtiledb-sql-py, plotly, ipywidgets, graphviz, pandas, pydot, trimesh, numpy, chardet, numba, tiledb-r, voila, opencv, tiledb-cloud, pybabylonjs, envbash, tiledb-ml
Genomics:
Everything in the Basic Data Science notebook plus:snakemake, tiledb-vcf, htslib, bcftools, pybedtools
Geospatial:
Everything in the Basic Data Science notebook plus:cartopy, datashader, descartes, folium, geos, geotiff, holoviews, imagemagick, laszip, libnetcdf, proj, shapely, scikit-build, proj, gdal, rasterio, mb-system, pdal, fiona, geopandas, scikit-mobility, xarray, tiledb-segy, capella-tools
The need to manage data has existed for decades and there are hundreds of different data management solutions available today. What is TileDB and how does it innovate in such a crowded space?
TileDB is the first and only universal data management system.
In this section we explain what we define as "universal data management" and the practical problem it solves. We start with the main motivation behind TileDB.
TileDB started as a research project at Intel Labs and MIT in 2014, where we made some observations.
Organizations work with a lot of disparate data (beyond tables) using a large variety of tools (beyond SQL) and find it challenging to manage and analyze their data at scale.
Data
Most database systems deal with one type of data, mostly tables.
Most data out there is not really tabular. Look at satellite imaging, biomedical imaging, video, weather, point cloud, and many others.
A lot of organizations maintain a lot of diverse data. For example, hospitals have clinical records, but also genomics, MRI scans, etc. Insurance companies may have weather data coupled with satellite imaging. Telecommunications companies may have customer records, in addition to LiDAR and location data.
Tools
Teams within an organization may be using a variety of programming languages and tools. Some may use SQL, but others may prefer Python or R. Some may wish to perform data analytics, but others may want to run Machine Learning tasks or simply visualize the data.
Collaboration
Collaboration and governance within and across organizations is very challenging when the data (and code) does not live within a centralized database that would ordinarily manage access policies and maintain logs for auditing.
Since there is no database that can tackle the above challenges, organizations often resort to building specialized data solutions in-house. This takes a lot of time, costs a lot of money, and often involves combining a lot of disparate data software which make data management even more challenging.
Having those observations in mind, we asked ourselves a few questions.
Can we build a data management system that can store, manage and analyze any data with any tool, and enable collaboration and scalable compute, while embracing future technological advancements?
Data storage
Is there a single data structure that can capture all data? In our mind, tables are constrained, whereas key-values, documents and graphs do not seem to be efficient models for data like images, video, weather, point clouds, genomics, and more.
Is there a way to abstract all storage in a way that the system can work on any backend (memory, cloud object store, or other)?
Common layers
What are the common components of a "data management" solution, regardless of the application domain (e.g., storage layer, an authentication layer, APIs, access control, logging, etc.)? Could these be common in Genomics, Earth Observation, Time Series, etc?
Data access
Can we abstract all access in a way that the system can efficiently work with any API or tool?
Collaboration
Can we scale access control to any number of users, anywhere in the world and beyond the limits of a data center?
Can we share code in a similar way to sharing data? In other words, can we treat code as data?
Monetization
Can we enable users to sell data and code they share with others?
Scalability
Can we take multi-tenancy to the extreme, without being constrained by clusters. Can we scale easily and elastically?
Does a scalable "data infrastructure" also imply a scalable "compute infrastructure"? In other words, if we build the former, can we also gain the foundation for the latter?
Future proofness
Is there a way to make the system future proof? That is, can we build it in a way that in the future we can easily extend it with any storage backend, any language API, any data analytics and visualization tool, any new hardware, and practically any technological advancement?
And while contemplating about the answers to these questions, we developed TileDB and introduced the concept of universal data management.
There is a need for rearchitecting data management from the ground up, in a universal way that is flexible and extendible to future user needs and technological advancements. Below we describe the most important aspects of TileDB Cloud that make it a universal data management system.
Universal data management system starts with universal storage.
We needed a single data format and storage engine that could handle all types of data. We observed that multi-dimensional arrays constitute a great candidate for that. We can prove that any data, from biomedical imaging to genomics to SAR to tables to anything, can be modeled very efficiently with (dense or sparse) multi-dimensional arrays. Also multi-dimensional arrays are the currency of data analytics and machine learning, because a lot of advanced mathematical computations (e.g., using Linear Algebra) are applied to vectors, matrices and tensors -- in other words, multi-dimensional arrays!
We developed an efficient multi-dimensional array data format, coupled with a powerful open-source storage engine called TileDB Open Source. In a nutshell, this engine supports:
Dense and sparse multi-dimensional array support
"Columnar" format and compression
Multi-threading and parallel IO
Cloud-optimized implementation
Rapid updates and array slicing
Data versioning and time traveling
Arbitrary metadata stored along with the array data
See the TileDB Open Source docs for more details.
TileDB Cloud relies on TileDB Open Source for data (and code) storage. All code written with TileDB Open Source can be used with TileDB Cloud by changing only a few configuration parameters. This allows the users to test their code locally, and transition to the cloud offering by changing 1-2 lines of code.
Once we can represent all data on a single format, we are well positioned to define a single layer of authentication and access control, regardless for all data types and application domains.
TileDB authentication and access control works as follows:
The user stores their data in the open-source TileDB Open Source array format on some scalable shared storage backend (e.g., AWS S3).
The user owns the data, TileDB Cloud does not do any hosting. The user only registers the array with TileDB Cloud, granting authentication keys to TileDB Cloud for accessing the data.
The user can create organizations, and share data and code with other organizations and users with various access policies. There is no bound on the number of users and organizations one can share data and code with. Users can collaborate with anyone within or beyond their organization.
When data and code is accessed, TileDB Cloud is responsible for securely checking and enforcing all the appropriate access policies.
There is no need to manage IAM roles, or Apache Sentry/Range setups anymore. TileDB Cloud handles everything transparently.
In addition, TileDB Cloud allows users to make data and code public, attaching descriptions, metadata and arbitrary tags. The data and code can then be discovered and used by any other TileDB Cloud user on the planet.
All access to arrays and code is logged and can be viewed for auditing purposes. TileDB Cloud allows users to keep track of how their shared or public arrays are being used and gain valuable insights.
Once authentication, access control and logging is pushed all the way down to storage, all APIs and integrations can inherit it.
TileDB Cloud enjoys the extreme interoperability offered by TileDB Emebedded (i.e., numerous language APIs and tool integrations). In addition, TileDB Cloud is constantly being extended to support more languages and tools for the added cloud features it provides (e.g., see serverless compute and Jupyter notebooks).
All the APIs and integrations (existing and future ones) inherit the authentication, access control and logging functionality we built directly on top of array storage. In other words, modeling all data universally as arrays allowed us to build a single layer for authentication, access control and logging, instead of building custom support for all the data types and APIs/tools used across different applications.
Building a universal data management system that can provide extreme multi-tenancy and scale requires building an entire distributed system infrastructure from scratch. Our implementation revealed additional capabilities that proved to be very valuable for scalable data analytics, ease of use, extracting monetary value from data and code and remaining relevant in the rapidly paced data technology space. Therefore, we gradually exposed these capabilities within TileDB Cloud, described below.
We outlined the following requirements around accessing arrays registered with TileDB Cloud:
Any user on the planet with appropriate access policies should be able to access data at any time.
There should be no limit on how many users can simultaneously access an array.
The user who accesses the array should not be responsible for spinning up dedicated machines.
The user who shares the array should not be responsible for spinning up dedicated machines.
The architecture we built to meet these requirements resulted in the following:
Totally "serverless" compute from the user's stand point. Any request "just works" without reserving resources in advance.
TileDB Cloud uses an elastic compute infrastructure, which automatically expands and shrinks based on user demand.
The user is charged in a pay-as-you-go fashion, and only for compute and data egress they consume.
The compute is sent to the data, respecting geographical cloud storage regions to eliminate egress cloud provider costs and maximize performance.
Once a scalable, elastic and serverless compute infrastructure is built, the possibilities around offering computational capabilities are limitless.
Users wanted to do more than just slicing arrays. For example, they wished to run advanced SQL queries and user-defined functions (UDFs), i.e., arbitrary code in Python, R or other language, potentially using external libraries and integrations, and manipulating the data efficiently, securely and inexpensively. In the most general scenario, users wanted to create task graphs, i.e., task workflows that can implement sophisticated distributed algorithms to take advantage of the computational power and ease of use of TileDB Cloud. But this functionality could readily be provided by the infrastructure we built. Therefore, we optimized it and exposed it.
We also took this one step further. SQL queries and UDFs are runnable code. But it is also shareable data. We stored UDFs as TileDB arrays (recall that arrays can model any data) and we unlocked all the TileDB Cloud capabilities even for UDFs, such as sharing, logging and exploration of public code.
TileDB Cloud unifies diverse data management with diverse analytics and collaboration in a single powerful platform.
A lot of data scientists find it convenient to use Jupyter notebooks for writing code and performing exploratory analysis. TileDB Cloud allows launching JupyterLab instances within the online console. The instances come with prepackaged images that include useful libraries, but the user can also install any library inside the Jupyter environment. TileDB Cloud handles this within its distributed infrastructure, without requiring the user to manually deploy servers.
Jupyter notebooks have become a standard tool for scientific analysis and reproducibility. Therefore, TileDB Cloud allows users to share notebooks in the same manner as arrays. Users can also make notebooks public, or explore notebooks that others have shared with the world.
TileDB Cloud provides an easy, efficient and inexpensive platform for scientific analysis, collaboration and reproducibility.
TileDB Cloud logs everything for auditing purposes. The tasks, duration, cost, resources used, user information, etc. As such, it had all the functionality needed to develop a full-fledged marketplace to allow users monetize their code and data based on the usage from other users. TileDB Cloud integrates with Stripe and handles all billing and accounting for users that wish to sell or buy data and code on a pay-as-you-go-fashion.
This is convenient for several reasons:
Sellers do not need to ship potentially huge quantities of data to buyers.
Sellers do not need to build their own infrastructure to serve data and code, as well as perform all billing and accounting.
Buyers can perform exploration and analysis on data from multiple vendors in a single platform.
Buyers do not need to download and host the potentially massive quantities they are purchasing.
A pay-as-you-go model is an alternative, more flexible model to the standard annual license model, which may be more economical for both buyers (due to scale) and sellers (due to paying only for what they use).
A universal data management system can adapt to technological change.
So far we have described that TileDB Cloud is universal in the following respects:
Data: TileDB can manage any data type, and hence can support any future data type.
Storage backends: TileDB abstracts the backend layer and thus can easily add support for new backends (on the cloud, in memory, or other).
APIs and tools: TileDB is all about extreme interoperability and it is designed to easily add support for any new popular language and tool.
Deployment: TileDB is cloud and data center agnostic and therefore can be deployed anywhere.
Hardware: TileDB is being implemented in a way that can benefit from hardware accelerators, and boost performance in clusters with heterogeneous instances.
Algorithms: TileDB allows the development of any arbitrary distributed algorithm (from SQL to Linear Algebra to genomics pipelines), which can easily be shared and improved through collaboration.
To sum up, TileDB Cloud is flexible and can adapt to change throughout its lifetime in an organization's software stack. User requirements and creativity around data processing continually increase. TileDB Cloud remains valuable and relevant by evolving based on user feedback, rather than becoming obsolete.
TileDB Cloud allows you to run any lambda-like user-defined function (UDF). More specifically, you write the code on your laptop using the TileDB Cloud client (see Installation), your function gets shipped and executed on stateless TileDB Cloud REST workers. You get charged only for the time it took to run the function and the amount of data that got returned to your laptop. You do not need to worry about launching or managing any computational resources.
There are two types of supported UDFs:
Generic: These can include any code.
Array UDFs: These are UDFs that are applied to slices of one or more arrays.
Running UDFs is particularly useful if you want to perform reductions (such as a sum or an average), since the amount of data returned is very small regardless of how much data the UDF processes.
TileDB Cloud currently supports only Python and R UDFs, but support for more languages will be added soon.
TileDB Cloud runs your UDF in a separate dedicated container for security. Any array access is executed in parallel on the same REST worker but separate containers, and the results are sent to the UDF container using zero-copy techniques for performance.
We offer Python and R UDF images based on the following versions:
3.9.15
4.3.2
3.7.12 (Deprecated)
Python 3.7 is deprecated in User Defined Functions and is no longer updated as of January 31st, 2024. Registered User Defined Functions under python 3.7 will continue to be available for execution with the packages listed on this page until August, 2024.
In the default environment that the UDF runs, we include the following Python packages:
numpy
1.23.5
1.21.6
pandas
1.5.3
1.3.5
tensorflow
2.11.0
1.14.0
numexpr
2.8.7
2.8.3
numba
0.59.1
0.56.3
xarray
2024.3.0
0.20.2
tiledb
0.30.2
0.23.4
scipy
1.13.1
1.7.3
boto3
1.34.106
1.25.0
tiledbvcf
0.32.0
0.26.6
tiledbsoma
1.12.3
1.5.2
cellxgene-census
1.14.1
1.9.0
In the default environment that the UDF runs, we include the following R packages:
Rcpp
1.0.12
tiledb
0.28.2
tiledbsoma
1.12.3
curl
5.2.1-1
RcppSpdlog
0.0.17-1
jsonlite
1.8.8-1
base64enc
0.1-3
R6
2.5.1
httr
1.4.7
mmap
0.6-22
remotes
2.4.2.1
SeuratObject
5.0.2
BiocManager
1.30.22
SingleCellExperiment
1.26.0
Geospacial image (geo
) is based on Python images and include the following packages:
PDAL
3.4.3
rasterio
1.4.a1
fiona
1.9.5
geopandas
0.14.4
scikit-mobility
1.1.2
xarray
2024.2.0
tiledb-cf
0.9.1
tiledb-segy
0.3.0
Genomics image (genomics)
is based on Python images and include the following packages:
bwa
0.7.18
0.7.17
java-jdk
8.0.112
1.8.0.112
picard
3.0.0
3.0.0
samtools
1.19.0
1.16.1
sra-tools
3.1.1
3.0.9
gatk4
4.3.0.0
N/A
Imaging image (imaging-dev
) is based on Python images and include the following packages:
tiledb-bioimg
0.2.11
0.2.7
scikit-image
0.22.0
0.22.0
openslide
3.4.1
3.4.1
openslide-python
1.3.1
1.3.1
simpleitk
1.19.0
1.16.1
Vector search image (vectorsearch
) is based on Python images and includes the following packages:
tiledb-vector-search
0.7.0
langchain
0.1.20
langchain-openai
0.0.8
hugginggace_hub
0.23.4
openai
1.14.3
pypdf
3.17.4
beautifulsoup4
4.12.3
tiktoken
0.5.2
PyMuPDF
1.24.7
transformers
4.42.4
orjsonl
1.0.0
If you would like additional packages added to the UDF environment, please leave your suggestion on our feedback request board.
Each UDF allows for the following configurations to be used:
standard (Default)
2
2GB
large
8
8GB
In the future, TileDB Cloud will offer more flexibility in choosing the types of resources to run the UDF on.
All UDFs will time out by default after 15 minutes, the value is configurable when submitting a UDF by using the timeout
parameter.
When it comes to scalable analytics, we observed the following challenges:
Spinning up and monitoring clusters on the cloud is cumbersome and can get expensive.
Users frequently do not know how many machines to provision in a cluster for a given workload. This results in either under provisioning that impacts performance, or over provisioning that leads to wasted cost due to idle compute.
When users slice array data from TileDB Cloud only to further process it in their own compute environment, (1) they get charged for egress, and (2) the performance is impacted by the extra network transmission cost that occurs between the TileDB Cloud machines and their own machines.
TileDB Cloud allows users to access and compute on arrays in a serverless manner from the user's standpoint, i.e., without thinking about provisioning for machines, and paying for idle compute or unnecessary egress. TileDB Cloud automatically parallelizes all tasks across thousands of machines and monitors their progress.
TileDB Cloud supports the following tasks:
Array writing or reading, e.g., basic ingestion and slicing.
SQL queries, from simple selections and filters, to aggregate queries and joins.
User-defined functions (UDFs), i.e., arbitrary code in Python, R or other languages.
Users can submit numerous such tasks concurrently and TileDB Cloud will process all of them in parallel, elastically expanding and shrinking its computational resources on demand without supervision by the user. That is, TileDB Cloud provides extreme multi-tenancy by default.
Serverless SQL and array UDFs (i.e., UDFs that are specifically applied to one or more TileDB arrays) have the additional benefit that they can minimize egress by reducing the returned results size, which is true especially in aggregation queries.
Any distributed algorithm can be modeled as a directed graph, where the nodes represent atomic tasks and the edges represent tasks dependencies (i.e., a task cannot begin its execution before all the tasks from the incoming edges have completed their execution). TileDB Cloud supports such task graphs, which can be programmatically created by the user and submitted to the platform. TileDB Cloud is responsible for parallelizing all tasks while respecting the dependencies, and for monitoring all progress. Task graphs are a powerful tool for creating any sophisticated algorithm and scale it on TileDB Cloud.
TileDB Cloud also provides automation for spinning up JupyterLab instances, so that users can run Jupyter notebooks without having to manually set up servers and deploy JupyterLab. This makes it very easy for user to kickstart their data analysis on TileDB Cloud.
Finally, TileDB Cloud treats UDFs and notebooks as data and thus allows users to share runnable code just as easily as they share data. This makes TileDB Cloud a powerful platform for collaboration and reproducibility of scientific results.
TileDB Cloud comes with two user-facing components:
Console: This is the TileDB Cloud UI you can use by signing up and logging into https://cloud.tiledb.com. On the console you can manage your arrays, see billing information, spin up a Jupyter notebook, etc. The Console Walkthrough tutorial is a good starting point for getting familiar with the TileDB Cloud console.
TileDB Cloud client: This is a library for programmatic API access of the various features of TileDB Cloud. You can use the client API to perform pretty much every action you can perform on the console (except for signing up and running Jupyter notebooks). You need to have a TileDB Cloud account to use the client, since it requires you to log in with your username/password or an API token.
It is generally faster to use API tokens in the TileDB Cloud client.
This page is currently under development and will be updated soon.
In this tutorial you will learn how to navigate and use the main components of the TileDB Cloud, namely:
Almost everything you can do on TileDB Cloud, you can also do programmatically using the TileDB Cloud client (see Installation).
TileDB Cloud has a versatile namespace selector designed to enhance your experience in managing data and collaborations.
Upon signing up, each user is allocated a dedicated, private primary namespace. This namespace serves as your personal workspace, ensuring your data remains isolated and organized (until explicitly shared with other users or orgs).
In addition to your private namespace, TileDB offers the capability to create or join multiple organization namespaces. These spaces fosters seamless teamwork by allowing users to collaborate on projects, share resources, and collectively manage data.
The namespace selector enables effortless movement between these private and organizational spaces, facilitating a smooth transition as you navigate between different projects and contexts.
You can receive notifications for various actions happening in TileDB Cloud. For instance, you can be notified when you're invited to join an organization or when someone shares an asset with you.
When you log in, the first page you see is Overview
. Here you can see a summary of your assets, your current bill and your recent activity.
You can easily launch Jupyter notebook server instances within TileDB Cloud.
Launching a Notebook server instance from TileDB Cloud, while boasting a wide range of advantages, might exhibit slightly extended launch times around 30-40 seconds. This is due to the careful allocation of resources that underpin its performance and dependability.
You can catalogue and access a wide-range of asset types in TileDB Cloud. From generic, fundamental assets like arrays, files, notebooks and UDFs to more sophisticated assets used across broader applications verticals, including geospatial analysis, genomics research and machine learning.
This holistic approach ensures that whether you're working with traditional data types or delving into specialized domains, TileDB Cloud is your one-stop solution for streamlined asset management across diverse fields.
Each asset category has it's dedicated browser where you can also filter and search for specific assets. In the asset browser you can navigate between:
My tab
: Your registered assets
Shared tab
: Assets that are shared with you
Public tab
: Publicly available assets
Favorites tab
: Assets that are marked as favorite
You can use keywords in the search field to search by name, tag or phrases included in the description of the public data and code.
Assets constitutes data, code and data products that belong to you or an organization you are a member of, as well as data and code shared with you by other users and organizations.
The asset categories currently supported by TileDB Cloud are listed in the table below. These assets can be registered and accessed in TileDB Cloud with various methods, described later in an another section.
Arrays
Multi-dimensional arrays adapt to efficiently capture all data modalities, at any scale
Files
Securely manage and share any file, grouped and organized within your dataset
Notebooks
Collaborate on Jupyter notebooks, without having to move large datasets, for complete reproducibility
Dashboards
The same notebooks that power analysis can publish data visualizations for low-code analytics
UDFs
Move computations closer to your data, with cloud user-defined functions in Python, R and SQL
Task Graphs
Blend basic tasks, like slicing, and UDFs to build any distributed algorithm, plus options for GPUs
VCF
Scale genomic analyses. Ingest data in parallel and append new samples to solve the N + 1 problem
SOMA
Access and analyze large collections of single-cell experiments on object stores
Biomedical Imaging
Efficiently store and share multi-resolution microscopy images for Cloud-based visualization and analysis
ML Models
Store ML models alongside direct access to multi-modal datasets for training and prediction
Vector Search
Efficient similarity search for vector embeddings
Point Cloud
Combine millions of points, such as those from LiDAR and SONAR, in complex 3D space for analysis-ready cloud access.
Geometry
Spatial entities with precise shapes, such as point, line and polygon, for analysis in GIS and mapping applications.
Raster
Geospatial gridded data for advanced analysis in geospatial.
You can preview various information from the overview tab of an asset. Rich descriptions, tags, URIs, permissions, versioning information along with some asset specific information.
The preview tab displays important information relative to the asset contents.
Previews are not supported for every asset type yet, but we continue to expand the feature gradually.
Explicitly on array assets you can view detailed information regarding the schema of the array.
Most of the asset types come with metadata, either inherited by the asset type itself, or defined by the user.
Any asset can be shared with explicit permissions via username or email. If the email invited is not a TileDB account already, it will prompt the user for a signup first.
From the settings tab you can update your asset description and license, assign tags, rename or remove your asset, change the cloud credentials and make your asset publicly accessible.
Some assets have specific actions associated with them (highlighted by the blue buttons) such as the ability to download the asset
, copy it to another namespace
, launch a notebook
or quickly add a description
to the asset.
Adding assets to TileDB Cloud usually consists of two actions:
The creation or transformation of an existing or new asset to a multi-dimensional array. This can be done programmatically, via ingestion or straight from TileDB Cloud.
The registration of the asset from it's original storage location (usually S3, Azure or other cloud storage provider) to TileDB Cloud. Again this can happen either programmatically, via ingestion or from TileDB Cloud.
It's pretty common for the creation and registration of an asset to happen simultaneously.
For example when uploading a file from your computer at TileDB Cloud, it gets automatically transformed into an array, registered at your preferred namespace and saved to your selected cloud storage provider. Voila! 🔮
The most common way to register assets, is from popular cloud storage providers. You will first need to set up your cloud credentials in your profile settings at TileDB Cloud in order to do that.
TileDB doesn't host any of your assets in it's own servers. Instead it utilises cloud-native practices to connect with all popular cloud storage providers such as S3, Azure and more.
You can view all the logged activity of assets you have access to.
You can view and edit your primary and organization profile, add cloud credentials, default storage paths, API tokens, and manage your billing.
On organizations you can also manage your team members.
That was a quick product tour of TileDB Cloud. You can signup for free and start using it today with $10 free credits.
This page is currently under development and will be updated soon.
In this tutorial you will learn how to:
Sign up / sign in to TileDB Cloud
Access a public array using the TileDB-Py library
Access basic array information using the TileDB Cloud client
View the public array on the TileDB Cloud console
View a task on the TileDB Cloud console
View and edit your profile on the TileDB Cloud console
First, sign up and you will get $10 in free credit. Feel free to contact us if you run out of credits and you are still evaluating TileDB Cloud, we'll be happy to help out with more credits. Then simply sign in with the username and password you created.
It is extremely easy to access arrays registered with TileDB Cloud via the core TileDB Open Source APIs, with literally a single change: adding your username/password as configuration parameters. In this tutorial we will use TileDB-Py, TileDB-R and TileDB-Java. Let's first install it with:
To validate the installation, run:
For Python, also make sure you have already installed pandas
and pyarrow
, or alternatively run:
We have added a couple of public arrays that every user on TileDB Cloud can access, under our TileDB-Inc
organization. Below we show an example for getting the schema of array tiledb://TileDB-Inc/quickstart_sparse
, and slicing its contents:
Congratulations! You just performed your first array query to a public array in TileDB Cloud!
There are several TileDB Clients that allow you to perform pretty much any kind of task you would otherwise do via the TileDB Cloud online console (also described later in this tutorial). Let's first install it with:
Check that it installed properly as follows:
Let's get the description of the array we used above:
Great work! In the following we will see how to view useful information directly through the TileDB Cloud console.
You can view the information of public arrays by directly navigating to the array's URL on TileDB Cloud, even if you are not signed in (TileDB Cloud provides "static view" of arrays). For example, you can click directly on https://cloud.tiledb.com/arrays/details/TileDB-Inc/quickstart_sparse/overview, which is the URL of the array we used in the previous sections. Once you do so, you will see:
Click on the Schema
tab to see the array schema:
Next, let's sign in and see some activity logs. Click on Activity
on the left-hand side menu, under Assets
:
Here you can see the various tasks that you have performed (like slicing an array), along with other information, such as the time, cost, duration, etc. Clicking on a task provides further information (to be covered in other tutorials).
Finally, you can view your profile information by clicking Profile
on the left-hand side menu
Here you can edit your personal information, change your password, etc.
Are interested in diving deeper into TileDB Cloud? Here is what we recommend you to do next.
Quick recipes:
Create API tokens for faster programmatic login.
Set up AWS credentials so that you can access and share your arrays through TileDB Cloud.
Add billing info, so that you can use the service once you run out of credits.
Next tutorials:
Get a console walkthrough
Start learning about serverless compute
Familiarize yourself with the power of task graphs
Learn to use TileDB Cloud in specific use cases
When you sign up, you get $10 in free credits so that you can start getting acquainted with the platform.
TileDB Cloud charges in three ways:
$0.11 / CPU / hour for a task time: This is the time it takes for a task from beginning to completion. Depending on the task and user parameters (see Array Access, Serverless SQL and Serverless UDFs), the number of CPUs used will be different. This price of $0.11 per CPU-hour is based on 1 CPU usage. The time is multiplied by the number of CPUs used in the task. At the end of the month, usage is summed up and rounded to the nearest second.
$0.14 / GB of data egress: This is the amount of data retrieved from the service to your client, not the data processed by a query. Egress is summed up and rounded to the GB at the end of the month.
$0.06 / CPU / hour for the notebook server duration: Notebooks are charged for how long the server is running and based on the size of the server. See Jupyter Notebooks for the instance types along with the number of CPUs for each type.
In TileDB Cloud you need to register your own data, which you must currently store in your own cloud buckets. TileDB Cloud does not offer storage hosting. You just register your already existing cloud-stored arrays (created with TileDB Open Source), and TileDB Cloud manages access and computation on those arrays without unnecessary copying and movement of the data.
All access via the web console, excluding notebook usage, is free and no charges are applied. TileDB Cloud bills you monthly.
The TileDB Cloud pricing suggests that you should minimize the data you request at your client, and instead try to either perform SQL / UDFs that reduce the data, or write the data back to cloud storage (e.g., which can be done with both SQL and UDFs).
You do not bear any extra charge for a public array or an array you shared with other users. Only the user that accesses the array gets charged for usage.
Suppose you run a UDF which takes five minutes and returns no data to your client outside TileDB Cloud. Assume also that the UDF uses 2 CPUs. You will then be charged:
Query time: (5 / 60) hours * 2 CPUs * $0.11 per CPU-hour = $0.0183
Data retrieved: $0.00
Total: $0.02
Suppose you slice 3 GB from a TileDB array using TileDB Cloud and the read access takes 45 seconds. Suppose the read query uses 16 CPUs. You will be charged:
Task time: (0.75 / 60) hours * 16 CPUs * $0.11 per CPU-hour = $0.0225
Egress: 3 GB * $0.14 per GB = $0.42
Total: $0.0225 + $0.42 = $0.44
Suppose you write 256 MB to a TileDB array and the write request takes 20 seconds. Suppose the write query uses 16 CPUs. You will be charged:
Query time: (0.33 / 60) hours * 16 CPUs * $0.11 per CPU-hour = $0.01
Data retrieved: 0 (no charge for writing, but see note below)
Total: $0.01
Each write may read back a few bytes (e.g., for acknowledgements), thus incurring some minimal egress. This is added to the final billing but it is negligible as compared to the query time.
Suppose you run a serverless SQL query which performs an aggregation on some TileDB array slice that takes 90 seconds. If the result is 1MB, you are charged:
Query time: (1.5 / 60) hours * 16 CPUs * $0.11 per CPU-hour = $0.044
Data retrieved: (1 / 1024) GB * $0.14 per GB = $0.00013671875
Total: $0.044
Observe that you pay only for the 1 MB you retrieved (which is negligible), regardless of how many bytes the UDF processed. You can see here the benefit of reductions with serverless computations.
Suppose you run a small notebook instance (2 CPUs) and keep it active for 8 hours. You will be charged:
Notebook time: 8 hours * 2 CPUs * $0.06 per CPU-hour = $0.96
Suppose you run a large notebook (16 CPUs) and keep it active for 30 minutes. You will be charged:
Notebook time: 0.5 hours * 16 CPUs * $0.06 per CPU-hour = $0.48
Our pricing is experimental. Contact us to provide feedback or if you need more free credits for your evaluation.
If you do not belong to any organization, then all charges (array access, UDF, SQL, notebooks) are directed to your personal account. If you belong to one or more organizations, the following rules apply:
For SQL, UDFs and notebooks , TileDB Cloud charges the user's first organization (the one the user joined first).
For array access, TileDB Cloud determines how the user got access to the array. Suppose the user is called user
and belongs to one or more organizations, but the "first" is called org
. Here are the possible scenarios:
Array owned by user
-> user
is charged.
Array owned by org
-> org
is charged
Array owned by another organization org2
, shared with org
-> org
is charged
Array owned by org2
, shared with org
AND shared with user
directly -> Error unless
Default namespace to charge
is configured (see below)
Array owned by another user user2
(not an organization) -> user
is charged
TileDB Cloud also allows you to specify explicitly who to charge:
Every programmatic request has a namespace
argument that allows you to specify the account to charge (e.g., see Selecting Who to Charge for SQL).
In your profile Settings
, there is a field called Default namespace to charge
.
This page group provides details on the TileDB Cloud internal mechanics. You can navigate from the menu on the left, or through the following links:
This page describes the architecture of our TileDB Cloud SaaS offering.
Currently, TileDB Cloud (SaaS) runs on AWS, but in the future it will be deployed on other cloud providers. The principles around multiple cloud regions and cloud storage described in the architecture below are directly extendible to other settings (on the cloud or on premises).
Do you wish to run TileDB Cloud under your full control on premises or on the cloud? See .
The following figure outlines the TileDB Cloud architecture, which is comprised of the following components:
Automatic Redirection
Orchestration
UI Console
System State
REST Workers
Jupyter Notebooks
We explain each of those components below.
TileDB Cloud maintains compute clusters in multiple cloud regions, geographically distributed across the globe. The reason is that users may store their data in cloud buckets located in different regions, and it is always faster and more economical to send the compute to the data; that eliminates egress costs, reduces latency and increases network speeds. However, users may not know which region the array they are accessing is located.
To facilitate sending the compute to the appropriate region, TileDB Cloud supports automatic redirection using the Cloudflare Workers service. This provides a scalable and serverless way to lookup the region of the array being accessed (maintaining a fast key-value store that is always in sync with the System State) and issue a 302
temporary redirect to the HTTP request. TileDB Open Source and the TileDB Cloud client will honor the redirection and send the the request to the TileDB Cloud service in the proper region (see Orchestration).
If your array lives in a cloud region unsupported by TileDB Cloud, the request is sent to us-east-1
. We plan a future improvement to redirect to the nearest region instead.
Currently, automatic redirection is enabled by default, and the behavior can be controlled by using a configuration parameter. The user can also always dispatch any query directly to a specific region.
In every cloud region, TileDB Cloud maintains a Kubernetes cluster that carries out all tasks, properly autoscaling and load balancing to match capacity with demand based upon several factors. We use the Kubernetes built in metrics and monitoring toolchain to ensure pod memory usage is monitored and we have an accurate picture of the real world workloads at all times.
Currently supported regions:
us-east-1
us-west-2
eu-west-2
ap-southeast-1
In each region we use a variety of compute EC2 instance types, predominantly from m5
, c5
and r5
classes.
The TileDB Cloud user interface console (https://cloud.tiledb.com) is a web app written in React that uses the REST Workers API across the same procedures and protocols as the clients. Many of the same routes are also used directly from one of the many clients, such a TileDB-Cloud-Py or TileDB-Cloud-R. The console web app autoscales based on the load, but currently it runs only inside the us-east-1
cluster.
TileDB Cloud maintains persistent state about user records, arrays, UDFs, billing, activity and more by using an always encrypted MariaDB instance. This instance is maintained in the us-east-1
region. In addition, this state is replicated and synced at all times with a read-only MariaDB instance maintained in every other supported region, in order to reduce latency for the queries executed in those regions.
TileDB Cloud's architecture is centered around a REST API Service. The service is a Go based application which provides all of the base functionality such as user management, authentication and access control, billing and monetization (via integration with Stripe), UDF execution, and serverless SQL orchestration used in TileDB Cloud. The REST Service is deployed in Kubernetes with a stateless design that allows for distributed orchestration and execution without the need for centralized coordination or locking.
The REST Service monitors resource usage and does its own book keeping in order to determine if it can service a request or if it should inform the client to retry later. By allowing the client to manage retries and with the high availability of the REST service architecture. TileDB Cloud is able to gracefully load balance and distribute the work across multiple instances.
The REST service handles the following types of serverless tasks, building upon the TileDB Open Source library:
TileDB Cloud offers hosted Jupyter notebooks by using Jupyter Docker Stacks for the base conda environments, and Jupyterhub / Zero to Jupyterhub K8S for the runtime environment. The notebooks are spawned inside Kubernetes using kubespawner to offer an isolated environment for each user with their own dedicated and persisted storage.
Currently, Jupyter notebooks can be spawned in the us-east-1
region, but soon TileDB Cloud will support multiple regions for notebooks.
TileDB Cloud runs over standard http connectivity, using tcp ports 80
and443
. Connection made on port 80
are automatically redirected to https over port 443
.
TileDB Cloud provides Open ID Connect support that can be used with any Open ID Connect compatible service. TileDB Cloud provide a fixed set of IP address used for the outbound request as part of the Open ID Connect sequence.
eu-west-2
13.41.67.254
18.134.194.194
18.135.61.196
us-west-2
35.81.95.218
54.185.206.57
54.189.31.204
us-east-1
52.21.38.106
54.87.160.2
52.70.6.129
ap-southeast-1
13.213.235.67
54.255.255.186
52.76.199.70
See Corporate SSO with TileDB Cloud SaaS if you are interested in enabling OIDC support for TileDB Cloud SaaS in your own environment.
In this tutorial you will learn:
How to ingest a LAS file as 3D TileDB sparse array with PDAL
How to slice LiDAR data natively from a TileDB array
How to visualize the sliced data.
How to run SQL queries on LiDAR data directly from TileDB
We will use the well known Autzen point cloud dataset.
You can preview this tutorial as TileDB Cloud notebook (no login is needed). You can also easily launch it within the TileDB Cloud UI console, but you will need to sign up / login to do so.
You can run all the commands of this notebook in your own client. The only changes required are:
This page is currently under development and will be updated soon.
In order to be able to create, register, and access arrays through the TileDB Cloud service, you need to set up access credentials. For S3 compatible object stores, TileDB Cloud supports both IAM Roles and Access Credential key pairs. TileDB Cloud securely stores all keys in an encrypted database and never grants your keys to any other user. TileDB Cloud uses your keys in containarized stateless workers, which are under TileDB's full control and inaccessible by any other user's code (e.g., SQL or UDFs).
Note: You can add multiple AWS keys to TileDB Cloud, register different arrays with different keys, select a key to be your default key, and revoke any key at any time.
You can add your AWS keys from the AWS credentials
tab of Settings
as follows:
With an AWS AssumeRole policy we are solving the very same issue we used keys before: Enable AWS cross-account access, so that a role in one account can access a bucket in a separate account.
When using AWS AssumeRole, temporary keys are created through the Service Token Service (STS), and used from the deployment party (TileDB Cloud Console). This means that for organisation purposes there is no need to create an AWS IAM User for every user logging into TileDB Cloud Console and generate key pairs. Instead, after a User is authenticated, the AssumeRole functionality enables TileDB Cloud Console to access the bucket on behalf of a User and the credentials used in that case can be reused by multiple Users in the same organisation that need to access the same S3 buckets.
As an example, let 's consider the account (Account A) we are signing up with TileDB Cloud to access bucket(s) in User's AWS account (Account B). For that purpose, Account B has a bucket created. The most common setup is to create an IAM role for TileDB Cloud to use and then allow it to access a specific bucket with an AWS S3 bucket policy. Requests for access to the bucket will only be granted coming from our AWS account with our external ID.
Steps:
In TileDB Cloud Console navigate to Settings then select the tab Cloud Credentials
Click Add credentials, then select ARN Role and click Next and Next in the following step which is just a short description
Select tab Existing Role that presents the Account A ID as well as the External ID
Select tab New Role that proposes the JSON Account B can use to create the role. Please note Account A Account ID as well as the External ID
In Account B, User (or Admin) can create the bucket policy
In Account B User (or Admin) can create the role, using Account A ID and External ID
In Account B User (or Admin) has to attach the policy to the role
Obtain ARN for the new role
Then press Next in TileDB Console Add Credentials modal dialog and enter a name for the new AssumeRole Credentials and the ARN obtained in previous step
Test the connection
Example configurations have been detailed below:
AWS IAM Role:
Note: Both the AWS Principal and External ID will be provided when attempting to register the ARN role
It is possible that encryption is needed for the target bucket
To enable KMS usage for the target bucket, it is needed to edit the policy for the KMS key and add a statement that gives access to the role used previously
Example configuration is provided below
Array access refers to any read or write operation to an array registered with TileDB Cloud and referenced via its tiledb://
URI. Each array access is directed to a particular Kubernetes cluster in a specific cloud region as explained in . Then this request is assigned to a REST worker pod in an elastic and load balanced manner. That worker uses 16 CPU cores and sets the total result buffer size for TileDB Open Source to 2GB RAM.
The REST worker performs authentication (looking up the system state), logs all activity, manages billing and monetization, and enforces the access policies. Most importantly, each REST worker is totally stateless, and requires no synchronization or locking, allowing TileDB Cloud to scale very gracefully and quickly recover from failure via retry policies.
TileDB Cloud allows you to perform any SQL query on your TileDB arrays in a serverless manner. No need to spin up or tear down any computational resources. You just need to have the TileDB Cloud client installed (see ). You get charged only for the time it took to run the SQL operation.
TileDB Cloud currently supports serverless SQL only through Python, R, and Java, but support for more languages will be added soon.
TileDB Cloud receives your SQL query and executes it on a stateless that runs a warm MariaDB instance using the storage engine. The results of the SQL query can be returned directly to you (when using TileDB-Cloud-Py
version 0.4.0
or newer) or they can be written back to an S3 array of your choice (either existing or new). Any happens on the same REST instance running the SQL query to optimize performance.
When results are returned directly, they are sent to the client in either JSON or Apache Arrow format, and in Python they are converted into a Pandas dataframe. This is most suitable for small results, such as aggregations or limit
queries.
Writing results to an S3 array is necessary to allow processing of SQL queries with large results, without overwhelming the user (who may be on their laptop). The user can always open the created TileDB array and fetch any desired data afterwards.
Each TileDB Cloud REST worker running a SQL query uses 16 CPUs and has a limit of 2GB RAM. Therefore, you must consider "sharding" a SQL query so that each result fits in 2GB of memory (see ). In the future, TileDB Cloud will offer flexibility in choosing the types of resources to run SQL on.
All SQL queries will time out after 15 minutes.
TileDB Cloud allows you to build arbitrary (directed acyclic) task graphs to combine any number of different tasks into one workflow. You can combine serverless UDFs, SQL and array access along with even local execution of any function.
TileDB Cloud currently supports serverless task graphs only in Python, but support for more languages will be added soon.
The task graph is currently driven by the client. The client can be in a hosted notebook, your local laptop, or even a serverless UDF itself. The client manages the graph, and dispatches the execution of severless jobs or local functions.
Currently, there is no node-to-node communication in a task graph. TileDB does offer server side passing of inputs and outputs without round tripping to a client. This provides the ability to efficiently pass data between stages of the task graph.
The local driver uses the Python ThreadPoolExecutor
by default to drive the tasks. The default number of workers is 4 * #cores
on the client machine. Python allows multiple serverless tasks to run as they use asynchronous HTTP requests. Serverless tasks will scale elastically. As you request more tasks to be run, TileDB Cloud launches more resources to accommodate the tasks.
Local functions are subject to the if the task graphs use the ThreadPoolExecutor (default).
This limits the concurrency of local functions, however serverless functionality is minimally effected.
This page is currently under development and will be updated soon.
You can create an API token by navigating to API tokens
in Settings
, as shown below. You can also provide an expiration date for your token upon creation. You can create multiple tokens and revoke them any time.
TileDB Cloud allows users to monetize their data and code, with its full-fledged marketplace that integrates with .
You need to have a Stripe account in order to take advantage of the monetization feature.
The monetization feature is currently in beta and works for arrays, but not for UDFs and notebooks yet.
You can select any of the arrays you own and specify the following:
$ per CPU per hour: This is a cost that TileDB Cloud will apply on top of the CPU cost that it would otherwise charge the users when accessing that array (see ).
$ per GB of egress: This is a cost that TileDB Cloud will apply on top of the egress cost that it would otherwise charge the users when accessing that array (see ).
Users accessing an array that comes with extra pricing from its owner get charged for two separate costs, one from TileDB Cloud (see ) and one from the array owner. Sellers can see all their revenue details directly on the Stripe platform. Buyers can see a complete breakdown of their costs in their TileDB Cloud invoices.
This page group contains simple recipes for managing your account. You can find the contents below:
In this tutorial, you will learn:
How to use task graphs and specifically the Delayed
API
How to scale your computation, significantly boosting performance, all serverless
How to eliminate egress costs
We will use public TileDB Cloud array , which stores the data from the for the year of 2019. The original data is in CSV format with collective size of about 7GB, which is converted into a TileDB 1D sparse array with the size being compressed down to ~1GB. The selected sparse dimension is tpep_pickup_datetime
, which means that the array supports very fast range slicing (and, therefore, also partitioning) on that column of the dataset.
You can this tutorial as TileDB Cloud notebook (no login is needed). You can also easily it within the TileDB Cloud UI console, but you will need to sign up / login to do so.
You can run all the commands of this notebook in your own client. The only changes required are:
Congratulations! You have successfully completed your very first TileDB Cloud tutorial!
the TileDB Cloud client
using the TileDB Cloud client before running any notebook command
To create an array you will need to use TileDB Open Source. After you create the array, if you would like to make it "visible" to TileDB Cloud, you need to register the array in a subsequent step. However, TileDB Open Source allows you to create and register the array in a single step, with a very simple change in the way you would otherwise create the array:
Instead of using <array-uri>
as you would typically in TileDB Open Source, you must use tiledb://<username>/<array-uri>
. For example, if you wish to create an array at s3://my_bucket/my_array
, you need to set the array URI to tiledb://my_username/s3://my_bucket/my_array
and TileDB Open Source will instruct TileDB Cloud to automatically register the array as tiledb://my_username/my_array
.
This page is currently under development and will be updated soon.
You can access the array details programmatically via the TileDB Cloud API (e.g., see Utilities), or on the console as follows. You can see the array overview, which includes the physical path of the array, permissions, a description, etc. Moreover, you can see the array schema, activity logs, metadata, sharing properties, monetization and settings.
Deleting (deregistering) an array (in Settings
) does not physically delete your array from the physical cloud storage. It simply deregisters the array from TileDB Cloud. Your data will still be accessible by you outside of TileDB Cloud if you own the appropriate AWS access keys.
Renaming an array (in Settings
) is under the danger zone, because from that point onwards you (and all the users you shared the array with) will have to change your code to add the new array name. The array will still be shared and accessible by the other users, but they will need to add the new name to their code. In other words, TileDB Cloud does not currently support automatic redirection of array URIs upon renaming.
You can register a TileDB array that is already created in a cloud bucket under your control, either via the TileDB Cloud API, or in the TileDB Cloud console as follows. Make sure that you have already set up your AWS credentials, which are needed to access the array you are registering, as those will have to be selected in the drop down list in the pop up, after you click on the "add array" button in Assets -> Arrays
. You can also choose who will be the owner of the array (you or an organization you belong to). Your array will be accessible by tiledb://<namespace>/<array-name>
after it is registered, where namespace
is the owner of the array and <array-name>
is the name of the array that you can select during this registration process in the pop up.
There is absolutely no data movement upon registering an array with TileDB Cloud. Your data continues to remain in your bucket and you are the full owner of this data. TileDB Cloud just records the path and the AWS keys that can access it, so that it can govern access when you or the users you share this array with need to access it.
This page is currently under development and will be updated soon.
You can add your payment method and see your billing information, including your current balance breakdown and all invoices, by navigating to the Billing
tab of your Settings
:
This page group contains simple recipes for managing your arrays. You can find the contents below:
Adjust your account details
This page is currently under development and will be updated soon.
You can change any information in your profile from the Profile
tab in your Settings
:
This page is currently under development and will be updated soon.
When you create UDFs, notebooks, dashboards, groups, and ML models, TileDB Cloud stores each asset as a TileDB object (array or group). In order to do so, it requires you to provide a default storage path for storing those assets. You can do so from your profile settings as follows:
This page is currently under development and will be updated soon.
You can also make an array public, effectively sharing it with every user on TileDB Cloud. This can make your array discoverable by other users that wish to explore public arrays, or you can discover useful datasets that others wish to share with the world.
To make an array public, you just need to navigate to the Settings
of that array and click on Make public
as shown below. Similarly, you can always switch the array back to private mode at any time.
When making an array public, you do not get charged for the accesses that other users make on this array. You only get charged for the accesses that you make on public arrays.
This page is currently under development and will be updated soon.
TileDB Cloud allows you to extract monetary value from your arrays, by setting a dollar amount for egress (per GB of data read by another user) and CPU (per hour of compute spent when another user reads your array).
Before you start monetizing your arrays, you need to create an account with Stripe and connect it with your TileDB Cloud account:
Now you are ready to add pricing to your array from the array's Pricing
tag:
This page is currently under development and will be updated soon.
You can share a registered array with any other user on TileDB Cloud. Currently, you can specify array-wide policies, such as read, write and read/write. We plan to add finer-grained access policies soon. To share an array, find it on Assets -> Arrays
and either click on the sharing button located on the right end of the array card in the list, or click on the array card and navigate to the Sharing
tab. The added member will appear in the array members list, where you will be able to change the access policy or revoke access from the user. Users get notified by email when someone shares an array with them.
When sharing an array with other users, you do not get charged for the accesses that those users make. You only get charged for the accesses that you make on your arrays.
When sharing with a member, TileDB Cloud uses auto-complete to facilitate finding a username you are looking for. Similar to GitHub/GitLab, the usernames are considered public information (in contrast to full names and emails that are protected). Please email us at privacy@tiledb.com
if you wish your username to be excluded from auto-complete.
Note that the array URL when you are viewing its Overview
is shareable, and another user can view it on their browser if they have access to it. URLs of public arrays can be viewed by users, even if they are not logged in.
This page group contains simple recipes for managing your notebooks. You can find the contents below:
TileDB Cloud notebook servers include a persistent home directory for you to install custom packages.
TileDB Cloud supports pip
, conda
, cran
, and other methods of installing packages. When installing a package, you should install it in your home directory if you want it persisted after reboots.
For Python, you can use pip
to install packages into your home directory with the --user
option.
If a package from pip
is installed without the --user
flag, it will not be persisted and will not be available upon reboot
Both conda
and mamba
are available in the notebook environment to install any available packages. Creating a custom environment in your home directory will allow you to install and persist any packages.
If conda install
or mamba install
is used outside of a custom environment, your packages will not be persisted.
R packages can be installed from CRAN by setting the `lib location to the home directory for persistence.
If package installations are done without setting the library path to your home directory, it's likely the installed packages will not be persisted.
This page is currently under development and will be updated soon.
To create a Jupyter notebook, you first need to navigate to the notebooks
tab from the sidebar under your assets
and then click the "plus" icon. That will prompt a dialog asking you to specify the notebook name, the path of the cloud storage space where the physical notebook will live and your cloud credentials that will allow TileDB Cloud to access that storage space. Once you create the notebook, you can launch it, edit and save it as you would otherwise do for any Jupyter notebook.
Alternatively you can create a notebook from the launch a JupyterLab notebook instance. Once you are in the Jupyter notebook environment, navigate to File -> New -> TileDB notebook
.
You can deregister a notebook by navigating to its Settings
and clicking on Deregister Notebook
.
Deleting (deregistering) a notebook does not physically delete your notebook from the physical cloud storage. It simply deregisters the notebook from TileDB Cloud. Your data will still be accessible by you outside of TileDB Cloud if you own the appropriate AWS access keys.
Renaming a notebook (in Settings
) is under the danger zone, because from that point onwards you (and all the users you shared the notebook with) will have to change your code to add the new notebook name. The notebook will still be shared and accessible by the other users, but they will need to add the new name to their code. In other words, TileDB Cloud does not currently support automatic redirection of notebook URIs upon renaming.
If you are using a notebook through TileDB Cloud's embedded Jupyter Lab environment, you can create any array or file inside your dedicated EBS volume. Just make sure you use ~/path/to/your/array
for array names, as the current working directory of the launched Jupyter environment might be different from your EBS home directory.
This page is currently under development and will be updated soon.
You can explore public notebooks, adding a variety of filters, from the Explore
page.
This page is currently under development and will be updated soon.
Jupyter notebooks are a common way to run Python code for data science or data exploration. You can find more information about how TileDB Cloud manages JupyterLab instances here.
You can launch a JupyterLab instance by clicking on Launch Notebook
from the left menu, select and an image and the server size, and click on the Start
button. After about 30 seconds, you are all set to start writing some code. To shut down the notebook server, simply click on the Shut down
button (it also takes a few seconds).
The notebook will also automatically shutdown after 30 minutes if you close the tab with the notebook server still running.
This page is currently under development and will be updated soon.
You can share a registered notebook with any other user on TileDB Cloud. To share a notebook, find it on Assets -> Notebooks
and either click on the sharing button located on the right end of the notebook card in the list, or click on the notebook card and navigate to the Sharing
tab. The added member will appear in the notebook members list, where you will be able to change the access policy or revoke access from the user. Users get notified by email when someone shares a notebook with them.
When sharing a notebook with other users, you do not get charged when those users run the notebook. You only get charged when you run notebooks.
When sharing with a member, TileDB Cloud uses auto-complete to facilitate finding a username you are looking for. Similar to GitHub/GitLab, the usernames are considered public information (in contrast to full names and emails that are protected). Please email us at privacy@tiledb.com
if you wish your username to be excluded from auto-complete.
Note that the notebook URL when you are viewing its Overview
is shareable, and another user can view it on their browser if they have access to it. URLs of public notebooks can be viewed by users, even if they are not logged in.
This page is currently under development and will be updated soon.
You can also make a notebook public, effectively sharing it with every user on TileDB Cloud. This can make your notebook discoverable by other users that wish to , or you can discover useful code that others wish to share with the world.
To make a notebook public, you just need to navigate to the Settings
of that notebook and click on Make public
as shown below. Similarly, you can always switch the notebook back to private mode at any time.
When making a notebook public, you do not get charged when other users run the notebook. You only get charged when you run public notebooks.
TileDB Cloud support uploading notebooks from git repositories. Its common that you might have notebooks stored in source control such as github or gitlab and wish to automatically upload them to TileDB Cloud for usage.
TileDB provides public github actions that let you upload a notebook.
Example usage includes:
Gitlab CI doesn't offer a marketplace or public template support currently. Instead find the bellow example for setting up a gitlab ci setup.
An API Token is required to access TileDB Cloud. Its recommended to set this as a secret in gitlab. This should be set as an environmental variable for TILEDB_REST_TOKEN
.
This page is currently under development and will be updated soon.
You can create, register and run UDFs only programmatically. See Serverless UDFs and Serverless Array UDFs for more information.
In order to be able to register a UDF you need to set up the default storage path for you and/or your organization.
Once you create and register a UDF using the TileDB Cloud client, you will be able to see the UDF on the UDFs
page in the left menu. After selecting a UDF from the list, you can see its description and basic information, preview its code, share the UDF with others and change its settings.
This page is currently under development and will be updated soon.
You can share a registered UDF with any other user on TileDB Cloud. To share an array, find it on Assets -> UDFs
and either click on the sharing button located on the right end of the UDF card in the list, or click on the UDF card and navigate to the Sharing
tab. The added member will appear in the UDF members list, where you will be able to change the access policy or revoke access from the user. Users get notified by email when someone shares a UDF with them.
When sharing a UDF with other users, you do not get charged for the accesses that those users make. You only get charged for the accesses that you make on your UDFs.
When sharing with a member, TileDB Cloud uses auto-complete to facilitate finding a username you are looking for. Similar to GitHub/GitLab, the usernames are considered public information (in contrast to full names and emails that are protected). Please email us at privacy@tiledb.com
if you wish your username to be excluded from auto-complete.
Note that the UDF URL when you are viewing its Overview
is shareable, and another user can view it on their browser if they have access to it. URLs of public UDFs can be viewed by users, even if they are not logged in.
This page group contains simple recipes for managing your UDFs. You can find the contents below:
To make a UDF public, you just need to navigate to the Settings
of that UDF and click on Make public
as shown below. Similarly, you can always switch the UDF back to private mode at any time.
When making a UDF public, you do not get charged for the accesses that other users make on this UDF. You only get charged for the accesses that you make on public UDFs.
This page is currently under development and will be updated soon.
You can explore public UDFs, adding a variety of filters, from the Explore
page.
This page group contains simple recipes for managing your organizations. You can find the contents below:
This page is currently under development and will be updated soon.
You can explore public arrays, adding a variety of filters, from the Explore
page.
This page is currently under development and will be updated soon.
You can add a user to your organization by clicking on Organizations
in the left menu and then selecting the organization you would like to add the user to. Click on the Members
tab, and then on the add button. On the pop up, you can choose to add a user with their TileDB Cloud username, or invite by email if the user has not signed up with TileDB Cloud yet.
Adding a member to your organization grants them access to the arrays of your organization, based on the policies you specified upon adding them.
Adding a member to your organization will affect the billing of this organization.
You can monitor and audit all activity on your organization arrays.
This page is currently under development and will be updated soon.
You can create a new organization by clicking on Organizations
in the left menu and then on the add organization button. Your new organization will appear on the screen. You can click on it to view its details, add a description, see its settings, etc.
This page is currently under development and will be updated soon.
Similarly to setting up AWS credentials for your account, you need to set up AWS keys to grant access to TileDB Cloud for the arrays that belong to your organizations. Recall that, when you register an array, you need to specify the owner of the array, which can be one of your organizations. If that is the case, you need to provide the corresponding AWS keys that you set up for that organization.
To add AWS keys to an organization, click on Organizations
in the left menu, then select the organization you are interested in, then click on AWS credentials
in the left side menu, and then on the add button. After entering the key information, the new credentials will appear in the list, where you will be able to edit or revoke them, or select the default key.
This page is currently under development and will be updated soon.
Your organization will get billed whenever:
A member slices or performs a serverless UDF on an organization array.
A member performs a serverless SQL query and specifies your organization to be billed.
Similar to , you can see the billing details of your organization by clicking on Organizations
in the left menu, then selecting the organization you are interested in from the list, and then the Billing
tab. On that page, you will be able to edit the billing information, and see the past invoices along with your current monthly balance.
To enjoy access to TileDB arrays registered with TileDB Cloud, you do not need an extra client. You can continue to use your favorite TileDB Open Source API or integration, tweak a couple of parameters and you are all set. See for details.
If you wish to enjoy the TileDB Cloud serverless capabilities and perform TileDB Cloud console actions (such as viewing tasks and array descriptions, or sharing data and code) programmatically, you need to add one of our cloud clients: , , . You can install the latest releases as follows:
The latest development version of TileDB-Cloud-Py can be installed directly from Github:
While the preferred method of running code samples and notebooks in this section is directly within TileDB Cloud (as all dependencies are installed for you), you can run most of the code samples and notebooks in this section locally. To run these code samples and notebooks locally, install the following dependencies:
Manage your TileDB Cloud Notebooks programmatically with the TileDB Cloud Python API
The TileDB Cloud Python API provides tools to access, manage, and share notebooks stored in the TileDB Cloud service.
Before accessing notebooks, you’ll need to log in to TileDB Cloud:
The TileDB Cloud Python API provides two ways to upload notebooks: you can provide a filename to upload, or you can provide the notebook’s contents as a string.
Likewise, notebooks can be downloaded and either saved as a file or kept as a string in memory:
Notebooks are stored as TileDB arrays, so sharing can be managed just like any other array.
Use array.list_shared_with
to see who has what kind of access to a notebook:
To share a notebook with another user, share the array. In most cases, you will want to provide a user or organization with read
or write
access to the array:
Use unshare_array
to revoke access (“unshare”) a notebook from a user or namespace:
To make a notebook public, share it with the special public
namespace:
To make a public notebook private again, revoke access from the public
namespace:
For reads, writes, embedded SQL, any integration, and any API, you can use TileDB Open Source with only two changes:
Set the TileDB configuration parameters rest.username
and rest.password
with your TileDB Cloud username and password, or alternatively rest.token
with the API token you created.
Every array registered with TileDB Cloud must be accessed using a URI of the form tiledb://<namespace>/<array-name>
, where<namespace>
is the user or organization who owns the array and <array-name>
is the array name set by the owner upon array registration. This URI is displayed on the console when viewing the array details.
Accessing arrays by setting an API token is typically faster than using your username and password.
Here are some Python/R/Java examples, although the above changes will work with any TileDB API or integration:
You can create an array inside or outside TileDB Cloud. The benefit of creating an array with TileDB Cloud is that it will be logged for auditing purposes. Moreover, it will be registered automatically with your account upon creation.
To instruct TileDB Open Source that you are creating an array through the TileDB Cloud service, you just need a single change:
Instead of using <array-uri>
as you would typically in TileDB Open Source, you must use tiledb://<username>/<array-uri>
. For example, if you wish to create an array at s3://my_bucket/my_array
, you need to set the array URI to tiledb://my_username/s3://my_bucket/my_array
.
It is possible to programmatically register an existing array. To do that you will need to use one of our cloud clients. See: Installation
SQL queries use MyTile, our custom MariaDB storage engine, and can be executed with the TileDB Cloud client as follows.
Supposing that there is an array tiledb://user/array_name
that you have write permissions for, you can run a SQL query that writes the results to that array as follows:
You can run any SQL statement, including CTEs.
If the array does not exist, you will just need to create and pass an array schema that complies with the result schema.
We also provide an auxiliary function exec_and_fetch
, which executes a SQL query as above, but then opens the resulting output array so that it is ready for use.
If you are a member of an organization, then by default the organization is charged for your SQL query. If you would like to charge the SQL query to yourself, you just need to add one extra argument namespace
.
Each serverless SQL runs by default in an isolated environment with 2 CPUs and 2 GB of memory. You can choose an alternative runtime environment from the following list:
standard
2 CPUs, 2 GB RAM
large
8 CPUs, 8 GB RAM
Charges are based on the total number of CPUs selected, not on actual use.
To run a serverless SQL in a specific environment, set the resource_class
parameter to the name of the environment.
An asynchronous version of serverless SQL is available. The _async
version returns a future.
It is also possible to use SQL to create a new array.
See Retry Settings for more information on this topic.
The TileDB-Cloud JDBC driver provides seamless integration with popular Business Intelligence (BI) tools such as Tableau and Microsoft Power BI. With the driver, you can connect your BI tools directly to TileDB Cloud and leverage the powerful visualization and analytics capabilities of these tools.
This page is currently under development and will be updated soon.
TileDB Cloud monitors and logs all your or your organizations' activity, which you can see by clicking Assets -> Tasks
from the left menu. You can also apply various filters on them to see the ones you are interested in. For each task, you can see various useful information, such as the code associated with it, duration, cost, system logs, etc.
You can also see all logged activity for specific arrays or notebooks, by navigating to the Activity
tab of the array or notebook:
Finally, note that you can also access task information programmatically (e.g., see Listing Tasks).
This page is currently under development and will be updated soon.
When you register a UDF under one of your organizations, TileDB Cloud stores the UDF code as a 1D dense array. In order to do so, it requires you to provide a default storage path for storing that UDF array. You can do so from your organization's profile settings as follows:
TileDB Cloud supports dashboards written in python (ipywidgets, panel) or R (shiny) via voila in Jupyter notebooks. Dashboards are built by first creating a notebook, then marking the notebook as a dashboard.
Any TileDB Cloud Notebook can be enabled as a dashboard by toggling in the notebook settings:
TileDB provides a R library, shinybg, which faciliates running R shiny applications inside jupyter notebooks. This allows the shiny app to run as a background process and for the shiny app to be displayed for use as a dashboard.
Using a shiny app in TileDB works in a similar manner to a standalone shiny app. The main change is using the renderShinyApp
function provided by shinybg
to display it. Below is a reproduction of the old faithful 101 shiny app. This is also available as a dashboard and notebook in TileDB Cloud.
Ipywidgets can be used directly in a jupyter notebook and rendered as a dashboard. Below is an example from the ipywidgets 2x2 tutorial. This can be used directly as a dashboard in TileDB Cloud. This is also available as a dashboard and notebook in TileDB Cloud.
Similar to ipywidget panel can be used directly in a jupyter notebook and a dashboard. Below is the example from "Build an app" in panel that can be used directly in TileDB Cloud as a dashboard. This is also available as a dashboard and notebook in TileDB Cloud.
is a type 4 Java Database Connectivity (JDBC) driver that allows you to connect to and interact with TileDB Cloud using the JDBC API. It provides a seamless integration between your Java applications and TileDB Cloud, enabling you to perform various database operations and execute SQL queries.
This documentation will guide you through the installation process, configuration options, and usage examples of the TileDB Cloud JDBC library.
Once you have established a connection, you can execute SQL queries against the TileDB Cloud database. To do so, create a Statement
object and call the executeQuery
method with your SQL query.
You can then handle the results like this:
To use the TileDB Cloud JDBC library, follow these steps:
Ensure you have Java Development Kit (JDK) version 11 or higher installed on your system.
Download the latest release of the TileDB Cloud JDBC library from the .
Add the tiledb-cloud-jdbc-x.x.x.jar
file to your project's classpath. Replace x.x.x
with the version number of the library.
To load the driver at runtime write Class.forName("io.tiledb.TileDBCloudDriver")
The TileDB Cloud JDBC library can be configured programmatically in your Java code. The configuration options include:
apiKey(String)
: Your TileDB-Cloud API Token. (recommended)
username(String)
: Your TileDB-Cloud username
password(String)
: Your TileDB-Cloud password
rememberMe(boolean)
: Whether the JDBC driver will remeber your login credentials in the future.
verifySSL(boolean)
: Whether the JDBC driver will use SSL
overwritePrevious(boolean)
: Whether the JDBC driver will overwrite existing credentials. This option can be combined with rememberMe.
Here's an example of configuring the driver where NAMESPACE
is your TileDB-Cloud namespace
You can also include your API Token in the connection String like this:
Tableau is a leading business intelligence and data visualization tool that allows users to create interactive and insightful visualizations, reports, and dashboards. Together with TileDB, Tableau empowers users to connect, explore, and visualize their data stored in TileDB Cloud, enabling seamless integration of advanced analytics and visualization capabilities into their data workflows.
To connect with , the TileDB-Cloud JDBC driver requires the use of our custom . Tableau has a built-in store for connectors, however, this TileDB connector is not currently available for download and needs to be manually placed in the appropriate directory.
To do this, copy the connector
directory from the repo (ignoring the LICENSE and README files) to the following location:
MacOS
~/Documents/My\ Tableau\ Repository/Connectors
Windows
C:\Users\[Windows User]\Documents\My Tableau Repository\Connectors
In addition to the connector placement, ensure that you have also placed the TileDB-Cloud JDBC driver (.jar file)
in the appropriate directory. If the directory doesn't already exist, create it.
MacOS
~/Library/Tableau/Drivers
Windows
C:\Program Files\Tableau\Drivers
To launch Tableau, use the following commands:
MacOS
/Applications/Tableau\ Desktop\ [version].app/Contents/MacOS/Tableau -DConnectPluginsPath=/Users/<USER>/Documents/My\ Tableau\ Repository/Connectors
Windows
tableau.exe -DConnectPluginsPath=C:\Users\[Windows User]\Documents\My Tableau Repository\Connectors
Once Tableau launches, choose TileDB-Cloud JDBC, by TileDB
from the left sidebar and enter your credentials to login.
Once you login choose All TileDB arrays
from the dropdown menu on the top left corner and you will be able to see all your owned and shared arrays. You can also add an array you have access to by using the Custom SQL Query
option.
, in conjunction with TileDB, offers a comprehensive business intelligence platform that enables users to connect, transform, and visualize data stored in TileDB Cloud. With Power BI's intuitive interface and robust analytics features, organizations can gain valuable insights, create interactive reports and dashboards, and make data-driven decisions effectively.
Then, click ok and your bridge should be set up. Now, open PowerBI Desktop and follow these steps:
Click on "Get Data" in the Home tab.
Select "More..." and search for "ODBC" in the data connectors list.
Choose the "ODBC" option and click "Connect".
In the ODBC dialog, select the bridged JDBC driver as a data source from the list.
By expanding the "Advanced options" section you can insert a custom SQL query. Otherwise click next and you will be displayed your owned, shared and public arrays from TileDB-Cloud
In order to use the JDBC driver for Power BI, a JDBC-to-ODBC bridge is required. We have used and tested the one from . Follow the instructions below:
This connector is part of the TIleDB-Cloud-Py package. To install run:
Query results are limited to 2GBs in size.
This JDBC driver uses our custom Mytile MariaDB storage engine which comes with it's limitations as well. For more info see here.
TileDB-Cloud-Py package includes a Python DB API 2.0 connector, which aligns with PEP 249. It offers a convenient way for Python developers to connect to TileDB-Cloud and perform all necessary operations.
Usage example
To configure the connector with your credentials you just need to configure TileDB-Cloud-Py. For more details see here.
Below we show how to use Python UDFs in TileDB Cloud, with an example that uses numpy to compute the median of random numbers.
The UDF can receive any number of arguments, with keyword arguments supported as well.
An async version of UDFs is available, which returns a future.
If you you are a member of an organization, then by default the organization is charged for your UDF. If you would like to charge the UDF task to yourself, you just need to add one extra argument namespace
.
Each UDF runs by default in an isolated environment with 2 CPUs and 2 GB of memory. You can choose an alternative runtime environment from the following list:
Charges are based on the total number of CPUs selected, not on the actual use.
To run a udf in a specific environment, set the resource_class
parameter to the name of the environment.
You can register a UDF (similar to arrays) as follows:
Currently, registering a UDF is only possible via the Python or R client.
In order to be able to register a UDF you need to set up the default storage path for and/or your .
See .
standard
2 CPUs, 2 GB RAM
large
8 CPUs, 8 GB RAM
Below we show how to use Python/R UDFs in TileDB Cloud, with an example that computes the median on the values of attribute a
on a slice of a 2D dense array.
For Python, you just need to write your function (median
in this example) that takes as input an ordered numpy dictionary, i.e., in the form {"a" : <numpy-array>, "b" : <numpy-array>, ...}
, where the keys are attribute or dimension names of the array you are querying. The reason is that this function will be applied on an array slice; recall that the Python API of TileDB returns an ordered dictionary of numpy arrays on each attribute and dimension upon a read. Then you just use the apply
function of the TileDB Cloud client, which takes as input your function, a slice, and optionally a list of attributes (default is all attributes). Note that only the selected attributes must appear in the ordered dictionary that you provide as input to your function.
For R, the story is similar: you just need to write your function that takes a data frame as input.
For Java, you can run an existing Python or R UDF. In this case we use a UDF that scales the values of the input argument.
All slices provided as input to theapply
function are inclusive.
Multi-index queries are supported when applying an array UDF. You can pass any number of tuples or slices using a list of lists syntax (one per dimension).
All slices provided as input to theapply
function are inclusive.
To execute an array UDF, it is not always necessary to have the array opened locally. For Python, an alternative function to apply the UDF on an array URI is provided.
An asynchronous version of the array UDFs is available.
If you you are a member of an organization, then by default the organization is charged for your array UDF. If you would like to charge the array UDF to yourself, you just need to add an extra argument namespace
.
Each Array UDF runs by default in an isolated environment with 2 CPUs and 2 GB of memory. You can choose an alternative runtime environment from the following list:
standard
2 CPUs, 2 GB RAM
large
8 CPUs, 8 GB RAM
Charges are based on the total number of CPUs selected, not on actual use.
To run a array udf in a specific environment, set the resource_class
parameter to the name of the environment.
You can register an array UDF (similar to arrays) as follows:
Currently, registering a UDF is only possible bia the Python or R client.
In order to be able to register a UDF you need to set up the default storage path for you and/or your organization.
TileDB Cloud provides also multi-array UDFs, i.e., UDFs that are applied to more than one array.
You can register a multi-array UDF simply as follows:
Currently, registering a UDF is only possible via the Python or R client.
See Retry Settings.
Delayed objects can be combined into a task graph, which is typically a directed acyclic graph (DAG). The output from one function or query can be passed into another, and dependencies are automatically determined.
The default mode of operation, realtime, is designed to return results directly to the client emphasis low latency. Realtime task graphs are scheduled and executed immediately and are well suited for fast distributed workloads.
In contrast to realtime task graphs, batch task graphs are designed for large, resource intensive asynchronous workloads. Batch task graphs are defined, uploaded, and scheduled for execution and are well suited for ingestion-style workloads.
The mode can be set for any of the APIs by passing in a mode
parameter. Accepted values are BATCH
or REALTIME
Any Python/R function can be wrapped in a Delayed
object, making the function executable as a future.
Besides arbitrary Python/R functions, serverless SQL queries and array UDFs can also be called with the delayed API.
It is also possible to include a generic Python function as delayed, but have it run locally instead of serverless on TileDB Cloud. This is useful for testing or for saving finalized results to your local machine, e.g., saving an image.
Any task graph created using the delayed API can be visualized with visualize()
. The graph will be auto-updated by default as the computation progresses. If you wish to disable auto-updating, then simply set auto_update=False
as a parameter to visualize()
. If you are inside a Jupyter notebook, the graph will render as a widget. If you are not on the notebook, you can set notebook=False
as a parameter to render in a normal Python window.
If a function fails or you cancel it, you can manually retry the given node with the .retry
method, or retry all failed nodes in a DAG with .retry_all()
. Each retry call retries a node once.
If you have a task graph that is running, you can cancel it with the the .cancel()
function on the dag or delayed object.
There are cases where you might have one function to depend on another without using its results directly. A common case is when one function manipulates data stored somewhere else (on S3 or a database). To facilitate this, we provide function depends_on
.
The above code, after the call to node_1.visualize()
, produces a task graph similar to that shown below:
A lower level Task Graph API is provided which gives full control of building out arbitrary task graphs.
If you are a member of an organization, then by default the organization is changed for your Delayed tasks. If you would like to charge the task to yourself, you just need to add one extra argument namespace
.
You can also set who to charge for the entire task graph instead of individual Delayed objects. This is often useful when building a large task graph, to avoid having to set the extra parameter on every object. Taking the example above, you just pass namespace="my_username"
to the compute
call.
Batch task graphs support the ability to use an registered access credential inside of task to provide access to an object store. This is commonly used for ingestion and exporting. TileDB Cloud supports allowing the use of AWS IAM roles or Azure SAS tokens for access. Your administrator needs to explicitly enable "allow in batch tasks" on the credential.
Realtime Task Graphs are driven by the client. The client dispatches each task as a separate request and potentially will fetch and return results. These requests are all in parallel and the maximum number of requests is controlled by defining how many threads are allowed to execute. This defaults to min(32, os.cpu_count() + 4)
in python. A function is provided to global configure this and allow a larger number of parallel requests and downloading of results to the client.
Batch task graphs allow you to specify resource requirements for CPU, Memory and GPUs for every individual task. In TileDB Cloud SaaS, GPUs leverage Nvidia V100 GPUs.
Resources can be passed directly to any of the Delayed or Task Graph submission APIs.
The TileDB Cloud client offers several useful utilities. To use them, you must have the client installed (see ).
TileDB Cloud allows you to login (with your username/password or ) in a way such that the session token can be cached to avoid logging in again for every program execution. This is done as follows:
After logging in for the first time, the TileDB Cloud client will store a session token in configuration file $HOME/.tiledb/cloud.json
created in your home directory.
The TileDB Cloud clients have the ability to retry failed HTTP requests automatically. By default this is enabled for retrying when TileDB Cloud indicates there is not enough capacity for the request (HTTP 503 errors). For convenience we also offer the ability to disable retries or to enable more forceful retry settings.
In "forceful" mode it is possible that the client might retry requests which will always fail, such as when there is a syntax error in a SQL query. This mode should be used with care to avoid increased costs from retrying.
All built in modes (besides disabled) will retry a request up to 10 times.
It is also possible to manually set retry conditions to suite your needs.
There are two helper functions that allow to easily create a tiledb
config or context that has the proper configuration needed for slicing arrays through TileDB Cloud.
You can see your user profile as follows:
You can list arrays from the cloud service, passing a variety of filters:
You can run the following to get basic information about the array, such as its description:
Array activity can be fetched programmatically as follows:
You can list tasks from the cloud service, passing a variety of filters:
For convenience, you can also see the last SQL or UDF task:
Or you can get a specific task with a given task ID (which can be found on the UI console):
In addition to registering S3-stored TileDB arrays with TileDB cloud via the console, you can also do it programmatically as follows:
You can deregister an array as follows:
Deregistering an array will not physically delete it.
You can programmatically share a registered array, "unshare" a registered array (i.e., revoke access) and list array sharing information as follows:
recipients
can include any combination of TileDB usernames and email addresses.
actions
allowed values are: READ, WRITE, EDIT, READ_ARRAY_LOGS, READ_ARRAY_INFO, READ_ARRAY_SCHEMA
You can cancel an invitation to an array as follows:
When accessing an array or group via the API, your request will be automatically routed to the instance closest to the data. If you already know the region, a compute region can be accessed directly with a configured parameter to manually bypass automatic redirection. Manually specifying the region can be helpful if you want to avoid the slight increase in latency that the redirection adds.
To access a region directly the domain is of the scheme: <region>.aws.api.tiledb.com
The five domains we currently support are:
us-east-1.aws.api.tiledb.com
us-west-2.aws.api.tiledb.com
eu-west-1.aws.api.tiledb.com
eu-west-2.aws.api.tiledb.com
ap-southeast-1.aws.api.tiledb.com
You can manually set the domain to send a request directly to a region as follows:
TileDB Cloud has the ability to convert files to and from the TileDB File representation. This allows you to store any arbitrary file as a 1 dimensions dense array. Importing and exporting to and from the original file format is supported directly through TileDB Cloud. The file-arrays can be stored on an object store, such as S3, directly.
In addition to registering S3-stored TileDB groups with TileDB cloud via the console, you can also do it programmatically as follows:
You can deregister an group as follows:
Deregistering a group will not physically delete it.
You can list arrays from the cloud service, passing a variety of filters:
You can run the following to get basic information about the array, such as its description:
You can invite users to a group as follows:
recipients
can include any combination of TileDB usernames and email addresses.
array_actions
allowed values are: READ, WRITE, EDIT, READ_ARRAY_LOGS, READ_ARRAY_INFO, READ_ARRAY_SCHEMA
group_actions
allowed values are: READ, WRITE, EDIT
You can cancel an invitation to a group as follows:
You can invite users to an organization as follows:
recipients
any combination of TileDB usernames and email addresses.
role
can be one of the following values: OWNER, ADMIN, READ_WRITE, READ_ONLY
You can accept an invite by its ID as follows:
You can fetch a paginated list of invitations as follows:
organization
: name or ID of organization to filter
array
: name/uri of array that is url-encoded to filter
group
: name or ID of group to filter
start
: start time for tasks to filter by
end
: end time for tasks to filter by
page
: pagination offset
per_page
: pagination limit
type
: invitation type, "ARRAY_SHARE", "JOIN_ORGANIZATION"
status
: Filter to only return "PENDING", "ACCEPTED"
orderby
: sort by which field valid values include
Similar to , you can invite users to an array as follows:
invitation_id
can be retrieved using
invitation_id
can be retrieved using
invitation_id
can be retrieved using