Serverless SQL

How It Works

TileDB Cloud allows you to perform any SQL query on your TileDB arrays in a serverless manner. No need to spin up or tear down any computational resources. You just need to have the TileDB Cloud client installed (see Installation). You get charged only for the time it took to run the SQL operation (see Pricing for more details).

TileDB Cloud currently supports serverless SQL only through Python, but support for more languages will be added soon.

TileDB Cloud receives your SQL query and executes it on a stateless TileDB Cloud worker that runs a warm MariaDB instance using the MyTile storage engine. The results of the SQL query can be returned directly to you (when using TileDB-Cloud-Py version 0.4.0 or newer) or they can be written back to an S3 array of your choice (either existing or new).

When results are returned directly, they are sent to the client as JSON and in Python they are converted into a Pandas dataframe. This is most suitable for small results, such as aggregations or limit queries.

Writing results to an S3 array is necessary to allow processing of SQL queries with large results, without overwhelming the user (who may be on their laptop). The user can always open the created TileDB array and fetch any desired data afterwards. If you need to write the SQL results to an array you are permitted to access, you need to set up your AWS keys to use the serverless SQL functionality.

Each TileDB Cloud working running a SQL query has a limit of 2GB RAM. Therefore, you must consider "sharding" SQL query so that each result fits in 2GB of memory (see also Parallel Computing). In the future, TileDB Cloud will offer flexibility in choosing the types of resources to run SLQ on.

Usage

Small results such as an average can easily be run and returned as follows:

Python
import tiledb.cloud
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
# res is a Pandas dataframe
res = tiledb.cloud.sql.exec(
query = "select `rows`, AVG(a) as avg_a from `tiledb://TileDB-Inc/quickstart_dense` GROUP BY `rows`")

Supposing that there is an array tiledb://user/array_name that you have write permissions, you can run a SQL query that writes the results to that array as follows:

Python
import tiledb.cloud
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
tiledb.cloud.sql.exec(
query = "select `rows`, AVG(a) as avg_a from `tiledb://TileDB-Inc/quickstart_dense` GROUP BY `rows`",
output_uri = "tiledb://user/array_name")

You can run any SQL statement, including CTEs.

Python
import tiledb.cloud
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
# INSERT INTO ...
tiledb.cloud.sql.exec(query = "insert into `tiledb://user1/output1` select AVG(a) from `tiledb://user1/myarray1`")
# CTE
tiledb.cloud.sql.exec(query =
"""
WITH t1 AS (select `dim`, `dim2`, AVG(a), AVG(a) OVER (partition by dim2) FROM `tiledb://user1/array1`)
insert into `tiledb://user/array_name`
select * from t1
""")

If the array does not exist, you will just need to create and pass an array schema that complies with the result schema.

Python
import tiledb.cloud
import numpy as np
import pandas
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
# The output array has a single dimension
dom = tiledb.Domain(tiledb.Dim(name="rows", domain=(1, 4), tile=4, dtype=np.int32))
# The output array has a single attribute
attr = tiledb.Attr(name="avg_a", dtype=np.float32)
# The output array is sparse
schema = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[attr])
# The following will create the output array and write the results to it
res = tiledb.cloud.sql.exec_and_fetch(
query = "select `rows`, AVG(a) as avg_a from `tiledb://TileDB-Inc/quickstart_dense` GROUP BY `rows`",
output_schema = schema,
output_uri = "tiledb://user/s3://my_bucket/my_array") # Notice the full path here
print(pandas.DataFrame(res[:]))

We also provide an auxiliary function exec_and_fetch, which executes a SQL query as above, but then opens the resulting output array so that it is ready for use.

Python
import tiledb.cloud
import pandas
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
res = tiledb.cloud.sql.exec_and_fetch(
query = "select `rows`, AVG(a) as avg_a from `tiledb://TileDB-Inc/quickstart_dense` GROUP BY `rows`",
output_uri="tiledb://user/array_name")
# `res` is an opened array for reading, ready for use
print(pandas.DataFrame(res[:]))

Charging an Organization

If you you are a member of an organization and you would like to charge the SQL query to that organization, you just need to add one extra argument namespace.

Python
import tiledb.cloud
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
res = tiledb.cloud.sql.exec(
namespace ="<organization>", # who to charge the query to
query = "select `rows`, AVG(a) as avg_a from `tiledb://TileDB-Inc/quickstart_dense` GROUP BY `rows`",
output_uri = "tiledb://user/array_name")