1 of 4

Dask

Quickstart

Dask is a great library for parallel computing in Python. It can work on your laptop with multiple threads and processes, or on a large cluster. We will take advantage of two very appealing Dask features:

Dynamic task scheduling. We can create arbitrarily complex task graphs using dask.delayed and let Dask execute them in parallel in our cluster.
Parallel arrays and dataframes. dask.array and dask.dataframe work similar to numpy arrays and Pandas dataframes, respectively, but they are extended to work for datasets larger than the main memory and perform computations in a distributed manner by multiple processes and machines.

TileDB currently integrates only with Dask arrays, but we are working on adding support for Dask dataframes. See our roadmap for updates.

Our examples focus only on a single machine, but will work on an arbitrary Dask cluster. Describing how to deploy a Dask cluster though is out of the scope of these docs.

Installation

You can install TileDB and Dask as follows:

conda install -c conda-forge tiledb tiledb-py dask

TileDB With Dask Array

TileDB integrates very well with dask.array. We demonstrate with an example below where attribute attr stores an int32 value per cell:

import dask
import dask.array as da

array = da.from_tiledb('s3://my-bucket/my-dense-array',
                        attribute='attr',
                        storage_options={'vfs.s3.aws_access_key_id': 'mykeyid',
                                         'vfs.s3.aws_secret_access_key': 'mysecret'})
print(array.mean().compute())

You can add any TileDB configuration parameter in storage_options. Moreover, storage_options accepts an additional key option, where you can pass an encryption key if your array is encrypted (see Encryption).

You can also set array chunking similar to Dask's chunking. For example, you can do the following:

import dask
import dask.array as da

array = da.from_tiledb('s3://my-bucket/my-dense-array',
                        attribute='attr',
                        chunks=10, # or chunks=(10,) 
                        storage_options={'vfs.s3.aws_access_key_id': 'mykeyid',
                                         'vfs.s3.aws_secret_access_key': 'mysecret'})
print(array.mean().compute())

You can also write a Dask array into TileDB as follows:

import tiledb
import numpy as np
import dask, dask.array as da

array = da.random.random((25,25))

array.to_tiledb("s3://my-bucket/my-uri", storage_options={'vfs.s3.aws_access_key_id': 'mykeyid',
                                      'vfs.s3.aws_secret_access_key': 'mysecret'})

Note that the TileDB array does not need to exist. The above function call will create it if it does not by inferring the schema from the Dask array. To write to an existing array, you should open the array for writing as follows, which will create new fragment(s):

import numpy as np
import dask, dask.array as da

array = da.random.random((25,25))

with tiledb.open(s3://my-bucket/my-uri, 'w') as A:
    array.to_tiledb(A)

Using an existing Array object allows extra customization of the array schema beyond what is possible with the automatic array creation shown earlier. For example, to create an array with a compression filter applied to the attribute, create the schema and array first, then write to the open Array:

import tiledb, dask, dask.array as da, numpy as np
from tiledb import Domain, Dim, Attr

uri = "/path/to/array"

# create schema with Zstd compression applied to the attribute:
schema = tiledb.ArraySchema(
    Domain([Dim("x", domain=(0,99), tile=100, dtype=np.uint64),
            Dim("y", domain=(0,99), tile=100, dtype=np.uint64)]),
    attrs=[Attr("v", dtype=np.float64, filters=[tiledb.ZstdFilter(1)])],
    sparse=False
)

# create empty array from the schema above:
tiledb.Array.create(uri, schema)

# write a 100x100 array of 1s to array created above:
with tiledb.open(uri, "w") as tdb_array:
    da.ones((100,100)).to_tiledb(tdb_array)

TileDB With Dask Delayed

dask.delayed is a powerful feature of Dask that allows you to create arbitrary task graphs and submit them to Dask's scheduler for execution. You can be truly creative with that functionality and implement sophisticated out-of-core computations (i.e., on larger than RAM datasets) and handle highly distributed workloads.

There is no special integration needed with TileDB, as dask.delayed is quite generic and can work with any user-defined task. We just point out here that you can use TileDB array slicing in a delayed task, which allows you to process truly large TileDB arrays on your laptop or on a large cluster.

We include a very simple example below, stressing though that one can implement much more complex algorithms on arbitrarily large TileDB arrays.

import tiledb
import numpy as np
import dask, dask.array

uri = "<array-uri>"
ctx = tiledb.Ctx()

# Create a simple 1D array with 1000 elements
def write_array():
    dom = tiledb.Domain(tiledb.Dim(name="x",
                                   domain=(0, 999),
                                   tile=10,
                                   dtype=np.uint64),
         	                       ctx=ctx)

    attrs = [tiledb.Attr(name="attr", dtype=np.float64, ctx=ctx),]

    schema = tiledb.ArraySchema(domain=dom, sparse=False,
                                attrs=attrs,
                                ctx=ctx)
    tiledb.DenseArray.create(uri, schema)

    with tiledb.DenseArray(uri, 'w') as A:
        A[:] = np.arange(1000,dtype=np.float64)

# Create the array only if it does not already exist
if not tiledb.VFS().is_dir(uri):
    write_array()

# This produces an array slice
def slice_tiledb(path, slc):
    with tiledb.DenseArray(path) as A:
        return A[slc]['attr']

# Partition the array into 50 delayed slices
partition = 50
delayed_slices = list(
    dask.delayed(slice_tiledb)(uri, slice(start, start+partition)) for 
                               start in 
                               np.arange(0,1001-partition,step=partition))

# This creates a Dask array from the delayed slices
darray = dask.array.concatenate(
    dask.array.from_delayed(x,
                            shape=(partition,), dtype=np.float64)
                            for x in delayed_slices)

#Everything up until here is lazy - nothing is really computed

# This triggers the entire computation
mean = darray.mean().compute()
print(mean)