Below we show how to use Python/R UDFs in TileDB Cloud, with an example that computes the median on the values of attribute a on a slice of a 2D dense array.
For Python, you just need to write your function (median in this example) that takes as input an ordered numpy dictionary, i.e., in the form {"a" : <numpy-array>, "b" : <numpy-array>, ...}, where the keys are attribute or dimension names of the array you are querying. The reason is that this function will be applied on an array slice; recall that the Python API of TileDB returns an ordered dictionary of numpy arrays on each attribute and dimension upon a read. Then you just use the apply function of the TileDB Cloud client, which takes as input your function, a slice, and optionally a list of attributes (default is all attributes). Note that only the selected attributes must appear in the ordered dictionary that you provide as input to your function.
For R, the story is similar: you just need to write your function that takes a data frame as input.
For Java, you can run an existing Python or R UDF. In this case we use a UDF that scales the values of the input argument.
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
# apply on subarray [1,2]x[1,2]
res = A.apply(median, [(1,2), (1,2)], attrs = ["a"])
print(res)
// Login by using a TileDBLogin object
TileDBClient tileDBClient = new TileDBClient(
new TileDBLogin(null,
null,
"<TILEDB_API_TOKEN>",
true,
true,
true));
// Create a TileDBUDF object
TileDBUDF tileDBUDF = new TileDBUDF(tileDBClient, "TileDB-Inc");
// add the query ranges
ArrayList<BigDecimal> range1 = new ArrayList<>();
range1.add(BigDecimal.valueOf(1));
range1.add(BigDecimal.valueOf(4));
ArrayList<BigDecimal> range2 = new ArrayList<>();
range2.add(BigDecimal.valueOf(1));
range2.add(BigDecimal.valueOf(4));
QueryRanges queryRanges = new QueryRanges();
queryRanges.addRangesItem(range1);
queryRanges.addRangesItem(range2);
// add the arguments for the UDF
HashMap<String,Object> argumentsForArrayUDF = new HashMap<>();
argumentsForArrayUDF.put("attr", "rows");
argumentsForArrayUDF.put("scale", 9);
// Create an array udf
MultiArrayUDF multiArrayUDF = new MultiArrayUDF();
multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");
multiArrayUDF.setRanges(queryRanges);
// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()
System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argumentsForArrayUDF, "tiledb://TileDB-Inc/quickstart_sparse", "TileDB-Inc"));
All slices provided as input to theapply function are inclusive.
Multi-Index Usage
Multi-index queries are supported when applying an array UDF. You can pass any number of tuples or slices using a list of lists syntax (one per dimension).
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
res = A.apply(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
print(res)
All slices provided as input to theapply function are inclusive.
Apply Without Opening The Array
To execute an array UDF, it is not always necessary to have the array opened locally. For Python, an alternative function to apply the UDF on an array URI is provided.
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
res = tiledb.cloud.array.apply("tiledb://TileDB-Inc/quickstart_dense", median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
print(res)
Asynchronous Execution
An asynchronous version of the array UDFs is available.
import tiledb, tiledb.cloud, numpy, random
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]
# res will be a future
res = A.apply_async(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
# call res.get() to block on the results
print(res.get())
library(tiledbcloud)
myfunc <- function(df) {
median(as.vector(df[["a"]]))
}
tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")
res <- delayed_array_udf(
array="TileDB-Inc/quickstart_dense",
udf=myfunc,
selectedRanges=list(cbind(1,2), cbind(1,2)),
attrs=c("a")
)
# call compute(res) to block on the results
o <- compute(res)
print(o)
Selecting Who to Charge
If you you are a member of an organization, then by default the organization is charged for your array UDF. If you would like to charge the array UDF to yourself, you just need to add an extra argument namespace.
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
# apply on subarray [1,2]x[1,2]
res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], namespace="my_username")
print(res)
// Login by using a TileDBLogin object
TileDBClient tileDBClient = new TileDBClient(
new TileDBLogin(null,
null,
"<TILEDB_API_TOKEN>",
true,
true,
true));
// Create a TileDBUDF object. The second param is the namespace to be charged
TileDBUDF tileDBUDF = new TileDBUDF(tileDBClient, "TileDB-Inc");
// add the query ranges
ArrayList<BigDecimal> range1 = new ArrayList<>();
range1.add(BigDecimal.valueOf(1));
range1.add(BigDecimal.valueOf(4));
ArrayList<BigDecimal> range2 = new ArrayList<>();
range2.add(BigDecimal.valueOf(1));
range2.add(BigDecimal.valueOf(4));
QueryRanges queryRanges = new QueryRanges();
queryRanges.addRangesItem(range1);
queryRanges.addRangesItem(range2);
// add the arguments for the UDF
HashMap<String,Object> argumentsForArrayUDF = new HashMap<>();
argumentsForArrayUDF.put("attr", "rows");
argumentsForArrayUDF.put("scale", 9);
// Create an array udf
MultiArrayUDF multiArrayUDF = new MultiArrayUDF();
multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");
multiArrayUDF.setRanges(queryRanges);
// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()
System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argu
Resource Classes
Each Array UDF runs by default in an isolated environment with 2 CPUs and 2 GB of memory. You can choose an alternative runtime environment from the following list:
Name
Description
standard
2 CPUs, 2 GB RAM
large
8 CPUs, 8 GB RAM
Charges are based on the total number of CPUs selected, not on actual use.
To run a array udf in a specific environment, set the resource_class parameter to the name of the environment.
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary):
return numpy.median(numpy_ordered_dictionary["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx()) as A:
# apply on subarray [1,2]x[1,2]
res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], resource_class="large")
print(res)
// Login by using a TileDBLogin object
TileDBClient tileDBClient = new TileDBClient(
new TileDBLogin(null,
null,
"<TILEDB_API_TOKEN>",
true,
true,
true));
// Create a TileDBUDF object
TileDBUDF tileDBUDF = new TileDBUDF(tileDBClient, "TileDB-Inc");
// add the query ranges
ArrayList<BigDecimal> range1 = new ArrayList<>();
range1.add(BigDecimal.valueOf(1));
range1.add(BigDecimal.valueOf(4));
ArrayList<BigDecimal> range2 = new ArrayList<>();
range2.add(BigDecimal.valueOf(1));
range2.add(BigDecimal.valueOf(4));
QueryRanges queryRanges = new QueryRanges();
queryRanges.addRangesItem(range1);
queryRanges.addRangesItem(range2);
// add the arguments for the UDF
HashMap<String,Object> argumentsForArrayUDF = new HashMap<>();
argumentsForArrayUDF.put("attr", "rows");
argumentsForArrayUDF.put("scale", 9);
// Create an array udf
MultiArrayUDF multiArrayUDF = new MultiArrayUDF();
multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");
multiArrayUDF.setRanges(queryRanges);
// Set resource class
multiArrayUDF.setResourceClass("large");
// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()
System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argumentsForArrayUDF, "tiledb://TileDB-Inc/quickstart_sparse", "TileDB-Inc"));
Registering Array UDFs
You can register an array UDF (similar to arrays) as follows:
Currently, registering a UDF is only possible bia the Python or R client.
In order to be able to register a UDF you need to set up the default storage path for you and/or your organization.
Multi-Array UDFs
TileDB Cloud provides also multi-array UDFs, i.e., UDFs that are applied to more than one array.
import numpy as np
import tiledb
import tiledb.cloud
array_1 = "tiledb://TileDB-Inc/array_1"
array_2 = "tiledb://TileDB-Inc/array_2"
def median(numpy_ordered_dictionary_list):
# When you have multiple arrays, the parameter
# we pass in is actually a list of ordered dictionaries.
# The list is in the order of the arrays you asked for.
return (
np.median(numpy_ordered_dictionary[0]["a"] + numpy_ordered_dictionary[1]["a"])
)
# The following will create the list of array to take part
# in the multi-array UDF. Each has as input the array name,
# a multi-index for slicing and a list of attributes to subselect on.
array_list = tiledb.cloud.array.ArrayList()
array_list.add(array_1, [(1, 4), (1, 4)], ["a"])
array_list.add(array_2, [(1, 2), (1, 4)], ["a"])
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
# This will execute `median` using as input the result of the
# slicing and subselection for each of the arrays in `array_list`
res = tiledb.cloud.array.exec_multi_array_udf(median, array_list)
print("Median Multi-array UDF:\n{}\n".format(res))libr
library(tiledbcloud)
myfunc <- function(df1, df2) {
median(as.vector(df1[["a"]])) + median(as.vector(df2[["a"]]))
}
# The following will create the list of array to take part in
# the multi-array UDF. Each has as input the array name, a
# multi-index for slicing and a list of attributes to subselect on.
details1 <- tiledbcloud::UDFArrayDetails$new(
uri="tiledb://TileDB-Inc/quickstart_dense",
ranges=QueryRanges$new(
layout=Layout$new('row-major'),
ranges=list(cbind(1,4),cbind(1,4))
),
buffers=list("a")
)
details2 <- tiledbcloud::UDFArrayDetails$new(
uri="tiledb://TileDB-Inc/quickstart_sparse",
ranges=QueryRanges$new(
layout=Layout$new('row-major'),
ranges=list(cbind(1,2),cbind(1,4))
),
buffers=list("a")
)
tiledbcloud::login(username="my_username", password="my_password")
# or tiledbcloud::login(token="my_token")
# This will execute `median` using as input the result of the
# slicing and subselection for each of the arrays in `array_list`
res <- tiledbcloud::execute_multi_array_udf(
array_list=list(details1, details2),
udf=myfunc,
)
print(res)
// Login by using a TileDBLogin object
TileDBClient tileDBClient = new TileDBClient(
new TileDBLogin(null,
null,
"<TILEDB_API_TOKEN>",
true,
true,
true));
// Create a TileDBUDF object
TileDBUDF tileDBUDF = new TileDBUDF(tileDBClient, "TileDB-Inc");
// add query ranges
ArrayList<BigDecimal> range1 = new ArrayList<>();
range1.add(BigDecimal.valueOf(1));
range1.add(BigDecimal.valueOf(4));
ArrayList<BigDecimal> range2 = new ArrayList<>();
range2.add(BigDecimal.valueOf(1));
range2.add(BigDecimal.valueOf(4));
QueryRanges queryRanges = new QueryRanges();
queryRanges.addRangesItem(range1);
queryRanges.addRangesItem(range2);
// set name of the udf to use
MultiArrayUDF multiArrayUDF = new MultiArrayUDF();
multiArrayUDF.setUdfInfoName("TileDB-Inc/multi-array-udf");
// create a list of the arrays to participate in the udf
List<UDFArrayDetails> arrays = new ArrayList<>();
//array1
UDFArrayDetails array1 = new UDFArrayDetails();
array1.setUri("tiledb://TileDB-Inc/dense-array");
array1.setRanges(queryRanges);
array1.setBuffers(Arrays.asList("rows", "cols", "a1"));
arrays.add(array1);
//array2
UDFArrayDetails array2 = new UDFArrayDetails();
array2.setUri("tiledb://TileDB-Inc/quickstart_dense");
array2.setRanges(queryRanges);
array2.setBuffers(Arrays.asList("rows", "cols", "a"));
arrays.add(array2);
multiArrayUDF.setArrays(arrays);
// add arguments
HashMap<String,Object> arguments = new HashMap<>();
arguments.put("attr1", "a1");
arguments.put("attr2", "a");
// print result. Could also use: executeMultiArrayJSON(), executeMultiArrayJSONArray(), executeMultiArrayArrow()
System.out.println(tileDBUDF.executeMultiArray(multiArrayUDF, arguments));
You can register a multi-array UDF simply as follows:
import tiledb, tiledb.cloud, numpy
def median(numpy_ordered_dictionary_list):
# When you have multiple arrays, the parameter
# we pass in is actually a list of ordered dictionaries.
# The list is in the order of the arrays you asked for.
return (
np.median(numpy_ordered_dictionary[0]["a"] + numpy_ordered_dictionary[1]["a"])
)
tiledb.cloud.login(username="my_username", password="my_password")
# or tiledb.cloud.login(token="my_token")
tiledb.cloud.udf.register_multi_array_udf(median, name="median_multi_array", namespace="my_username")