Below we show how to use Python/R UDFs in TileDB Cloud, with an example that computes the median on the values of attribute a on a slice of a 2D dense array.
For Python, you just need to write your function (median in this example) that takes as input an ordered numpy dictionary, i.e., in the form {"a" : <numpy-array>, "b" : <numpy-array>, ...}, where the keys are attribute or dimension names of the array you are querying. The reason is that this function will be applied on an array slice; recall that the Python API of TileDB returns an ordered dictionary of numpy arrays on each attribute and dimension upon a read. Then you just use the apply function of the TileDB Cloud client, which takes as input your function, a slice, and optionally a list of attributes (default is all attributes). Note that only the selected attributes must appear in the ordered dictionary that you provide as input to your function.
For R, the story is similar: you just need to write your function that takes a data frame as input.
For Java, you can run an existing Python or R UDF. In this case we use a UDF that scales the values of the input argument.
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarray [1,2]x[1,2] res = A.apply(median, [(1,2), (1,2)], attrs = ["a"])print(res)
// Login by using a TileDBLogin objectTileDBClient tileDBClient =newTileDBClient(new TileDBLogin(null,null,"<TILEDB_API_TOKEN>",true,true,true));// Create a TileDBUDF objectTileDBUDF tileDBUDF =newTileDBUDF(tileDBClient,"TileDB-Inc");// add the query rangesArrayList<BigDecimal> range1 =newArrayList<>();range1.add(BigDecimal.valueOf(1));range1.add(BigDecimal.valueOf(4));ArrayList<BigDecimal> range2 =newArrayList<>();range2.add(BigDecimal.valueOf(1));range2.add(BigDecimal.valueOf(4));QueryRanges queryRanges =newQueryRanges();queryRanges.addRangesItem(range1);queryRanges.addRangesItem(range2);// add the arguments for the UDFHashMap<String,Object> argumentsForArrayUDF =newHashMap<>();argumentsForArrayUDF.put("attr","rows");argumentsForArrayUDF.put("scale",9);// Create an array udfMultiArrayUDF multiArrayUDF =newMultiArrayUDF();multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");multiArrayUDF.setRanges(queryRanges);// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argumentsForArrayUDF, "tiledb://TileDB-Inc/quickstart_sparse", "TileDB-Inc"));
All slices provided as input to theapply function are inclusive.
Multi-Index Usage
Multi-index queries are supported when applying an array UDF. You can pass any number of tuples or slices using a list of lists syntax (one per dimension).
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4] res = A.apply(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])print(res)
All slices provided as input to theapply function are inclusive.
Apply Without Opening The Array
To execute an array UDF, it is not always necessary to have the array opened locally. For Python, an alternative function to apply the UDF on an array URI is provided.
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]res = tiledb.cloud.array.apply("tiledb://TileDB-Inc/quickstart_dense", median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
print(res)
Asynchronous Execution
An asynchronous version of the array UDFs is available.
import tiledb, tiledb.cloud, numpy, randomdefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]# res will be a future res = A.apply_async(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])# call res.get() to block on the resultsprint(res.get())
library(tiledbcloud)myfunc<-function(df) {median(as.vector(df[["a"]]))}tiledbcloud::login(username="my_username", password="my_password")# or tiledbcloud::login(token="my_token")res <-delayed_array_udf( array="TileDB-Inc/quickstart_dense", udf=myfunc, selectedRanges=list(cbind(1,2), cbind(1,2)), attrs=c("a"))# call compute(res) to block on the resultso <-compute(res)print(o)
Selecting Who to Charge
If you you are a member of an organization, then by default the organization is charged for your array UDF. If you would like to charge the array UDF to yourself, you just need to add an extra argument namespace.
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarray [1,2]x[1,2] res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], namespace="my_username")print(res)
// Login by using a TileDBLogin objectTileDBClient tileDBClient =newTileDBClient(new TileDBLogin(null,null,"<TILEDB_API_TOKEN>",true,true,true));// Create a TileDBUDF object. The second param is the namespace to be chargedTileDBUDF tileDBUDF =newTileDBUDF(tileDBClient,"TileDB-Inc");// add the query rangesArrayList<BigDecimal> range1 =newArrayList<>();range1.add(BigDecimal.valueOf(1));range1.add(BigDecimal.valueOf(4));ArrayList<BigDecimal> range2 =newArrayList<>();range2.add(BigDecimal.valueOf(1));range2.add(BigDecimal.valueOf(4));QueryRanges queryRanges =newQueryRanges();queryRanges.addRangesItem(range1);queryRanges.addRangesItem(range2);// add the arguments for the UDFHashMap<String,Object> argumentsForArrayUDF =newHashMap<>();argumentsForArrayUDF.put("attr","rows");argumentsForArrayUDF.put("scale",9);// Create an array udfMultiArrayUDF multiArrayUDF =newMultiArrayUDF();multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");multiArrayUDF.setRanges(queryRanges);// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argu
Resource Classes
Each Array UDF runs by default in an isolated environment with 2 CPUs and 2 GB of memory. You can choose an alternative runtime environment from the following list:
Name
Description
standard
2 CPUs, 2 GB RAM
large
8 CPUs, 8 GB RAM
Charges are based on the total number of CPUs selected, not on actual use.
To run a array udf in a specific environment, set the resource_class parameter to the name of the environment.
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarray [1,2]x[1,2] res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], resource_class="large")print(res)
// Login by using a TileDBLogin objectTileDBClient tileDBClient =newTileDBClient(new TileDBLogin(null,null,"<TILEDB_API_TOKEN>",true,true,true));// Create a TileDBUDF objectTileDBUDF tileDBUDF =newTileDBUDF(tileDBClient,"TileDB-Inc");// add the query rangesArrayList<BigDecimal> range1 =newArrayList<>();range1.add(BigDecimal.valueOf(1));range1.add(BigDecimal.valueOf(4));ArrayList<BigDecimal> range2 =newArrayList<>();range2.add(BigDecimal.valueOf(1));range2.add(BigDecimal.valueOf(4));QueryRanges queryRanges =newQueryRanges();queryRanges.addRangesItem(range1);queryRanges.addRangesItem(range2);// add the arguments for the UDFHashMap<String,Object> argumentsForArrayUDF =newHashMap<>();argumentsForArrayUDF.put("attr","rows");argumentsForArrayUDF.put("scale",9);// Create an array udfMultiArrayUDF multiArrayUDF =newMultiArrayUDF();multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");multiArrayUDF.setRanges(queryRanges);// Set resource classmultiArrayUDF.setResourceClass("large");// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argumentsForArrayUDF, "tiledb://TileDB-Inc/quickstart_sparse", "TileDB-Inc"));
Registering Array UDFs
You can register an array UDF (similar to arrays) as follows:
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")tiledb.cloud.udf.register_single_array_udf(median, name="median_test", namespace="my_username")
library(tiledbcloud)myfunc<-function(df) {median(as.vector(df[["a"]]))}tiledbcloud::login(username="my_username", password="my_password")# or tiledbcloud::login(token="my_token")tiledbcloud::register_udf(namespace="my_username", name="median_test", type="single_array", func=myfunc)
Currently, registering a UDF is only possible bia the Python or R client.
In order to be able to register a UDF you need to set up the default storage path for you and/or your organization.
Multi-Array UDFs
TileDB Cloud provides also multi-array UDFs, i.e., UDFs that are applied to more than one array.
import numpy as npimport tiledbimport tiledb.cloudarray_1 ="tiledb://TileDB-Inc/array_1"array_2 ="tiledb://TileDB-Inc/array_2"defmedian(numpy_ordered_dictionary_list):# When you have multiple arrays, the parameter # we pass in is actually a list of ordered dictionaries. # The list is in the order of the arrays you asked for.return ( np.median(numpy_ordered_dictionary[0]["a"] + numpy_ordered_dictionary[1]["a"]) )# The following will create the list of array to take part# in the multi-array UDF. Each has as input the array name,# a multi-index for slicing and a list of attributes to subselect on.array_list = tiledb.cloud.array.ArrayList()array_list.add(array_1, [(1, 4), (1, 4)], ["a"])array_list.add(array_2, [(1, 2), (1, 4)], ["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")# This will execute `median` using as input the result of the# slicing and subselection for each of the arrays in `array_list` res = tiledb.cloud.array.exec_multi_array_udf(median, array_list)print("Median Multi-array UDF:\n{}\n".format(res))libr
library(tiledbcloud)myfunc<-function(df1, df2) {median(as.vector(df1[["a"]])) +median(as.vector(df2[["a"]]))}# The following will create the list of array to take part in# the multi-array UDF. Each has as input the array name, a# multi-index for slicing and a list of attributes to subselect on.details1 <- tiledbcloud::UDFArrayDetails$new( uri="tiledb://TileDB-Inc/quickstart_dense", ranges=QueryRanges$new( layout=Layout$new('row-major'), ranges=list(cbind(1,4),cbind(1,4)) ), buffers=list("a"))details2 <- tiledbcloud::UDFArrayDetails$new( uri="tiledb://TileDB-Inc/quickstart_sparse", ranges=QueryRanges$new( layout=Layout$new('row-major'), ranges=list(cbind(1,2),cbind(1,4)) ), buffers=list("a"))tiledbcloud::login(username="my_username", password="my_password")# or tiledbcloud::login(token="my_token")# This will execute `median` using as input the result of the# slicing and subselection for each of the arrays in `array_list`res <- tiledbcloud::execute_multi_array_udf( array_list=list(details1, details2), udf=myfunc,)print(res)
// Login by using a TileDBLogin objectTileDBClient tileDBClient =newTileDBClient(new TileDBLogin(null,null,"<TILEDB_API_TOKEN>",true,true,true));// Create a TileDBUDF objectTileDBUDF tileDBUDF =newTileDBUDF(tileDBClient,"TileDB-Inc");// add query rangesArrayList<BigDecimal> range1 =newArrayList<>();range1.add(BigDecimal.valueOf(1));range1.add(BigDecimal.valueOf(4));ArrayList<BigDecimal> range2 =newArrayList<>();range2.add(BigDecimal.valueOf(1));range2.add(BigDecimal.valueOf(4));QueryRanges queryRanges =newQueryRanges();queryRanges.addRangesItem(range1);queryRanges.addRangesItem(range2);// set name of the udf to useMultiArrayUDF multiArrayUDF =newMultiArrayUDF();multiArrayUDF.setUdfInfoName("TileDB-Inc/multi-array-udf");// create a list of the arrays to participate in the udfList<UDFArrayDetails> arrays =newArrayList<>();//array1UDFArrayDetails array1 =newUDFArrayDetails();array1.setUri("tiledb://TileDB-Inc/dense-array");array1.setRanges(queryRanges);array1.setBuffers(Arrays.asList("rows","cols","a1"));arrays.add(array1);//array2UDFArrayDetails array2 =newUDFArrayDetails();array2.setUri("tiledb://TileDB-Inc/quickstart_dense");array2.setRanges(queryRanges);array2.setBuffers(Arrays.asList("rows","cols","a"));arrays.add(array2);multiArrayUDF.setArrays(arrays);// add argumentsHashMap<String,Object> arguments =newHashMap<>();arguments.put("attr1","a1");arguments.put("attr2","a");// print result. Could also use: executeMultiArrayJSON(), executeMultiArrayJSONArray(), executeMultiArrayArrow()System.out.println(tileDBUDF.executeMultiArray(multiArrayUDF, arguments));
You can register a multi-array UDF simply as follows:
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary_list):# When you have multiple arrays, the parameter # we pass in is actually a list of ordered dictionaries. # The list is in the order of the arrays you asked for.return ( np.median(numpy_ordered_dictionary[0]["a"] + numpy_ordered_dictionary[1]["a"]) )tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")tiledb.cloud.udf.register_multi_array_udf(median, name="median_multi_array", namespace="my_username")
library(tiledbcloud)myfunc<-function(df1, df2) {median(as.vector(df1[["a"]])) +median(as.vector(df2[["a"]]))}tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")tiledbcloud::register_udf(namespace=namespace, name="median_multi_array", type='multi_array', func=myfunc)
Currently, registering a UDF is only possible via the Python or R client.