Serverless Array UDFs

Basic Usage

Below we show how to use Python/R UDFs in TileDB Cloud, with an example that computes the median on the values of attribute a on a slice of a 2D dense array.

For Python, you just need to write your function (median in this example) that takes as input an ordered numpy dictionary, i.e., in the form {"a" : <numpy-array>, "b" : <numpy-array>, ...}, where the keys are attribute or dimension names of the array you are querying. The reason is that this function will be applied on an array slice; recall that the Python API of TileDB returns an ordered dictionary of numpy arrays on each attribute and dimension upon a read. Then you just use the apply function of the TileDB Cloud client, which takes as input your function, a slice, and optionally a list of attributes (default is all attributes). Note that only the selected attributes must appear in the ordered dictionary that you provide as input to your function.

For R, the story is similar: you just need to write your function that takes a data frame as input.

For Java, you can run an existing Python or R UDF. In this case we use a UDF that scales the values of the input argument.

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarray [1,2]x[1,2]
    res = A.apply(median, [(1,2), (1,2)], attrs = ["a"])
    print(res)

All slices provided as input to theapply function are inclusive.

Multi-Index Usage

Multi-index queries are supported when applying an array UDF. You can pass any number of tuples or slices using a list of lists syntax (one per dimension).

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
    res = A.apply(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
    print(res)

All slices provided as input to theapply function are inclusive.

Apply Without Opening The Array

To execute an array UDF, it is not always necessary to have the array opened locally. For Python, an alternative function to apply the UDF on an array URI is provided.

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
res = tiledb.cloud.array.apply("tiledb://TileDB-Inc/quickstart_dense", median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
print(res)

Asynchronous Execution

An asynchronous version of the array UDFs is available.

import tiledb, tiledb.cloud, numpy, random

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
    # res will be a future
    res = A.apply_async(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])

    # call res.get() to block on the results
    print(res.get())

Selecting Who to Charge

If you you are a member of an organization, then by default the organization is charged for your array UDF. If you would like to charge the array UDF to yourself, you just need to add an extra argument namespace.

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarray [1,2]x[1,2]
    res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], namespace="my_username")
    print(res)

Resource Classes

Each Array UDF runs by default in an isolated environment with 2 CPUs and 2 GB of memory. You can choose an alternative runtime environment from the following list:

Name
Description

standard

2 CPUs, 2 GB RAM

large

8 CPUs, 8 GB RAM

Charges are based on the total number of CPUs selected, not on actual use.

To run a array udf in a specific environment, set the resource_class parameter to the name of the environment.

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
    # apply on subarray [1,2]x[1,2]
    res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], resource_class="large")
    print(res)

Registering Array UDFs

You can register an array UDF (similar to arrays) as follows:

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary):
  return numpy.median(numpy_ordered_dictionary["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

tiledb.cloud.udf.register_single_array_udf(median, name="median_test", namespace="my_username")

In order to be able to register a UDF you need to set up the default storage path for you and/or your organization.

Multi-Array UDFs

TileDB Cloud provides also multi-array UDFs, i.e., UDFs that are applied to more than one array.

import numpy as np
import tiledb
import tiledb.cloud

array_1 = "tiledb://TileDB-Inc/array_1"
array_2 = "tiledb://TileDB-Inc/array_2"

def median(numpy_ordered_dictionary_list):
  # When you have multiple arrays, the parameter 
  # we pass in is actually a list of ordered dictionaries. 
  # The list is in the order of the arrays you asked for.
    return (
        np.median(numpy_ordered_dictionary[0]["a"] + numpy_ordered_dictionary[1]["a"])
    )

# The following will create the list of array to take part
# in the multi-array UDF. Each has as input the array name,
# a multi-index for slicing and a list of attributes to subselect on.
array_list = tiledb.cloud.array.ArrayList()    
array_list.add(array_1, [(1, 4), (1, 4)], ["a"])
array_list.add(array_2, [(1, 2), (1, 4)], ["a"])

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

# This will execute `median` using as input the result of the
# slicing and subselection for each of the arrays in `array_list` 
res = tiledb.cloud.array.exec_multi_array_udf(median, array_list)

print("Median Multi-array UDF:\n{}\n".format(res))libr

You can register a multi-array UDF simply as follows:

import tiledb, tiledb.cloud, numpy

def median(numpy_ordered_dictionary_list):
  # When you have multiple arrays, the parameter 
  # we pass in is actually a list of ordered dictionaries. 
  # The list is in the order of the arrays you asked for.
    return (
        np.median(numpy_ordered_dictionary[0]["a"] + numpy_ordered_dictionary[1]["a"])
    )

tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")

tiledb.cloud.udf.register_multi_array_udf(median, name="median_multi_array", namespace="my_username")

Retry Settings

See Retry Settings.

Last updated