When it comes to scalable analytics, we observed the following challenges:
Spinning up and monitoring clusters on the cloud is cumbersome and can get expensive.
Users frequently do not know how many machines to provision in a cluster for a given workload. This results in either under provisioning that impacts performance, or over provisioning that leads to wasted cost due to idle compute.
When users slice array data from TileDB Cloud only to further process it in their own compute environment, (1) they get charged for egress, and (2) the performance is impacted by the extra network transmission cost that occurs between the TileDB Cloud machines and their own machines.
TileDB Cloud allows users to access and compute on arrays in a serverless manner from the user's standpoint, i.e., without thinking about provisioning for machines, and paying for idle compute or unnecessary egress. TileDB Cloud automatically parallelizes all tasks across thousands of machines and monitors their progress.
TileDB Cloud supports the following tasks:
Array writing or reading, e.g., basic ingestion and slicing.
SQL queries, from simple selections and filters, to aggregate queries and joins.
User-defined functions (UDFs), i.e., arbitrary code in Python, R or other languages.
Users can submit numerous such tasks concurrently and TileDB Cloud will process all of them in parallel, elastically expanding and shrinking its computational resources on demand without supervision by the user. That is, TileDB Cloud provides extreme multi-tenancy by default.
Serverless SQL and array UDFs (i.e., UDFs that are specifically applied to one or more TileDB arrays) have the additional benefit that they can minimize egress by reducing the returned results size, which is true especially in aggregation queries.
Any distributed algorithm can be modeled as a directed graph, where the nodes represent atomic tasks and the edges represent tasks dependencies (i.e., a task cannot begin its execution before all the tasks from the incoming edges have completed their execution). TileDB Cloud supports such task graphs, which can be programmatically created by the user and submitted to the platform. TileDB Cloud is responsible for parallelizing all tasks while respecting the dependencies, and for monitoring all progress. Task graphs are a powerful tool for creating any sophisticated algorithm and scale it on TileDB Cloud.
TileDB Cloud also provides automation for spinning up JupyterLab instances, so that users can run Jupyter notebooks without having to manually set up servers and deploy JupyterLab. This makes it very easy for user to kickstart their data analysis on TileDB Cloud.
Finally, TileDB Cloud treats UDFs and notebooks as data and thus allows users to share runnable code just as easily as they share data. This makes TileDB Cloud a powerful platform for collaboration and reproducibility of scientific results.