Dask is a great library for parallel computing in Python. It can work on your laptop with multiple threads and processes, or on a large cluster. We will take advantage of two very appealing Dask features:
Dynamic task scheduling. We can create arbitrarily complex task graphs using
dask.delayed and let Dask execute them in parallel in our cluster.
Parallel arrays and dataframes.
dask.dataframe work similar to numpy arrays and Pandas dataframes, respectively, but they are extended to work for datasets larger than the main memory and perform computations in a distributed manner by multiple processes and machines.
Our examples focus only on a single machine, but will work on an arbitrary Dask cluster. Describing how to deploy a Dask cluster though is out of the scope of these docs.
You can install TileDB and Dask as follows:
conda install -c conda-forge tiledb tiledb-py dask