Serverless Array UDFs

How It Works

TileDB Cloud allows you to run Python lambda-like user-defined functions (UDFs) applied on array slices. More specifically, you write the code on your laptop using the TileDB Cloud client (see Installation), your function gets shipped and executed on stateless TileDB Cloud workers .You get charged only for the time it took to run the function and the amount of data that got returned to your laptop (see Pricing for more details). You do not need to worry about launching or managing any computational resources. You can also run UDFs on any array you have access to.

TileDB Cloud runs your UDF in a separate container than the one that performs the slicing from S3 using AWS keys, and the two containers communicate only via REST. Therefore, there is no way for the UDF to compromise security in TileDB Cloud.

Running UDFs is particularly useful if you want to perform reductions (such as a sum or an average), since the amount of data returned is very small regardless of how much data you process on TileDB Cloud.

TileDB Cloud currently supports only Python UDFs, but support for more languages will be added soon.

Each TileDB Cloud worker uses 2 CPUs and up to 2GB RAM for your function. Therefore, you must consider appropriately slicing your arrays such that each slice fits in 2GB of memory (see also Parallel Computing). In the future, TileDB Cloud will offer flexibility in choosing the types of resources to run the UDF on.

Cloud API Access inside UDF

For convenience and security for array UDFs, similar to generic UDF and severless SQL, a temporary access token is created and set as environment variables for TileDB to use. TileDB supports reading configuration parameters from the environment, so the config for TILEDB_REST_TOKEN and TILEDB_REST_SERVER_ADDRESS are set in the UDF container. This allows for you to access any API functionality, including running your own array slices without having to pass in any API tokens or credentials.

When the UDF is finished, the temporary token is revoked and deleted. The temporary token also has a timeout of 30 minutes for additional security. If you find you need to run a UDF longer than 30 minutes please contact us.

Packages included in UDF environment

The packages for Array UDFs are the same as the serverless UDFs.

Usage

Below we show how to use Python UDFs in TileDB Cloud, with an example that computes the median on the values of attribute a on a slice of a 2D dense array. You just need to write your function (median in this example) that takes as input an ordered numpy dictionary, i.e., in the form {"a" : <numpy-array>, "b" : <numpy-array>, ...}, where the keys are attribute or dimension names of the array you are querying. The reason is that this function will be applied on an array slice; recall that the Python API of TileDB returns an ordered dictionary of numpy arrays on each attribute and dimension upon a read. Then you just use the apply function of the TileDB Cloud client, which takes as input your function, a slice, and optionally a list of attributes (default is all attributes). Note that only the selected attributes must appear in the ordered dictionary that you provide as input to your function.

Python
Python
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
with tiledb.DenseArray("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
# apply on subarray [1,4]x[1,4]
res = A.apply(median, [(1,4), (1,4)], attrs = ["a"])
print(res)

Multi-Index Usage

Multi-index queries are supported for applying a UDF. You can pass any number of tuples or slices using a list of lists syntax (one per dimension).

Python
Python
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
with tiledb.DenseArray("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
res = A.apply(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
print(res)

All slices provided as input to the apply function are inclusive.

Apply Without Opening The Array

To execute an array UDF, it is not always necessary to have the array opened locally. An alternative function to apply the UDF on an array URI is provided.

Python
Python
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
res = tiledb.cloud.array.apply("tiledb://TileDB-Inc/quickstart_dense", median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
print(res)

Asynchronous Execution

Similar to generic UDFs and severless SQL, an asynchronous version of the array UDFs is available. The _async version returns a future.

Python
Python
import tiledb, tiledb.cloud, numpy, random
def median():
vals = []
for i in range(0, random.randrange(1,50)):
vals.append(random.randrange(0, i))
return numpy.median(vals)
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
with tiledb.DenseArray("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
# res will be a future
res = A.apply_async(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
# call res.get() to block on the results
print(res.get())