Below we show how to use Python/R UDFs in TileDB Cloud, with an example that computes the median on the values of attribute a on a slice of a 2D dense array.
For Python, you just need to write your function (median in this example) that takes as input an ordered numpy dictionary, i.e., in the form {"a" : <numpy-array>, "b" : <numpy-array>, ...}, where the keys are attribute or dimension names of the array you are querying. The reason is that this function will be applied on an array slice; recall that the Python API of TileDB returns an ordered dictionary of numpy arrays on each attribute and dimension upon a read. Then you just use the apply function of the TileDB Cloud client, which takes as input your function, a slice, and optionally a list of attributes (default is all attributes). Note that only the selected attributes must appear in the ordered dictionary that you provide as input to your function.
For R, the story is similar: you just need to write your function that takes a data frame as input.
For Java, you can run an existing Python or R UDF. In this case we use a UDF that scales the values of the input argument.
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarray [1,2]x[1,2] res = A.apply(median, [(1,2), (1,2)], attrs = ["a"])print(res)
// Login by using a TileDBLogin objectTileDBClient tileDBClient =newTileDBClient(new TileDBLogin(null,null,"<TILEDB_API_TOKEN>",true,true,true));// Create a TileDBUDF objectTileDBUDF tileDBUDF =newTileDBUDF(tileDBClient,"TileDB-Inc");// add the query rangesArrayList<BigDecimal> range1 =newArrayList<>();range1.add(BigDecimal.valueOf(1));range1.add(BigDecimal.valueOf(4));ArrayList<BigDecimal> range2 =newArrayList<>();range2.add(BigDecimal.valueOf(1));range2.add(BigDecimal.valueOf(4));QueryRanges queryRanges =newQueryRanges();queryRanges.addRangesItem(range1);queryRanges.addRangesItem(range2);// add the arguments for the UDFHashMap<String,Object> argumentsForArrayUDF =newHashMap<>();argumentsForArrayUDF.put("attr","rows");argumentsForArrayUDF.put("scale",9);// Create an array udfMultiArrayUDF multiArrayUDF =newMultiArrayUDF();multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");multiArrayUDF.setRanges(queryRanges);// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argumentsForArrayUDF, "tiledb://TileDB-Inc/quickstart_sparse", "TileDB-Inc"));
All slices provided as input to theapply function are inclusive.
Multi-Index Usage
Multi-index queries are supported when applying an array UDF. You can pass any number of tuples or slices using a list of lists syntax (one per dimension).
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4] res = A.apply(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])print(res)
All slices provided as input to theapply function are inclusive.
Apply Without Opening The Array
To execute an array UDF, it is not always necessary to have the array opened locally. For Python, an alternative function to apply the UDF on an array URI is provided.
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]res = tiledb.cloud.array.apply("tiledb://TileDB-Inc/quickstart_dense", median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])
print(res)
Asynchronous Execution
An asynchronous version of the array UDFs is available.
import tiledb, tiledb.cloud, numpy, randomdefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarrays [1,2]x[1,4] and [4,4]x[1,4]# res will be a future res = A.apply_async(median, [[(1,2), 4], [slice(1,4)]], attrs = ["a"])# call res.get() to block on the resultsprint(res.get())
library(tiledbcloud)myfunc<-function(df) {median(as.vector(df[["a"]]))}tiledbcloud::login(username="my_username", password="my_password")# or tiledbcloud::login(token="my_token")res <-delayed_array_udf( array="TileDB-Inc/quickstart_dense", udf=myfunc, selectedRanges=list(cbind(1,2), cbind(1,2)), attrs=c("a"))# call compute(res) to block on the resultso <-compute(res)print(o)
Selecting Who to Charge
If you you are a member of an organization, then by default the organization is charged for your array UDF. If you would like to charge the array UDF to yourself, you just need to add an extra argument namespace.
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarray [1,2]x[1,2] res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], namespace="my_username")print(res)
// Login by using a TileDBLogin objectTileDBClient tileDBClient =newTileDBClient(new TileDBLogin(null,null,"<TILEDB_API_TOKEN>",true,true,true));// Create a TileDBUDF object. The second param is the namespace to be chargedTileDBUDF tileDBUDF =newTileDBUDF(tileDBClient,"TileDB-Inc");// add the query rangesArrayList<BigDecimal> range1 =newArrayList<>();range1.add(BigDecimal.valueOf(1));range1.add(BigDecimal.valueOf(4));ArrayList<BigDecimal> range2 =newArrayList<>();range2.add(BigDecimal.valueOf(1));range2.add(BigDecimal.valueOf(4));QueryRanges queryRanges =newQueryRanges();queryRanges.addRangesItem(range1);queryRanges.addRangesItem(range2);// add the arguments for the UDFHashMap<String,Object> argumentsForArrayUDF =newHashMap<>();argumentsForArrayUDF.put("attr","rows");argumentsForArrayUDF.put("scale",9);// Create an array udfMultiArrayUDF multiArrayUDF =newMultiArrayUDF();multiArrayUDF.setUdfInfoName("TileDB-Inc/array-udf");multiArrayUDF.setRanges(queryRanges);// execute. Could also use: executeSingleArrayArrow(), executeSingleArrayJSON(), executeSingleArrayJSONArray()System.out.println(tileDBUDF.executeSingleArray(multiArrayUDF, argu
Resource Classes
Each Array UDF runs by default in an isolated environment with 2 CPUs and 2 GB of memory. You can choose an alternative runtime environment from the following list:
Charges are based on the total number of CPUs selected, not on actual use.
To run a array udf in a specific environment, set the resource_class parameter to the name of the environment.
import tiledb, tiledb.cloud, numpydefmedian(numpy_ordered_dictionary):return numpy.median(numpy_ordered_dictionary["a"])tiledb.cloud.login(username="my_username", password="my_password")# or tiledb.cloud.login(token="my_token")with tiledb.open("tiledb://TileDB-Inc/quickstart_dense", ctx=tiledb.cloud.Ctx())as A:# apply on subarray [1,2]x[1,2] res = A.apply(median, [(1,2), (1,2)], attrs = ["a"], resource_class="large")print(res)