Dask is a great library for parallel computing in Python. It can work on your laptop with multiple threads and processes, or on a large cluster. We will take advantage of two very appealing Dask features:
Dynamic task scheduling. We can create arbitrarily complex task graphs using dask.delayed
and let Dask execute them in parallel in our cluster.
Parallel arrays and dataframes. dask.array
and dask.dataframe
work similar to numpy arrays and Pandas dataframes, respectively, but they are extended to work for datasets larger than the main memory and perform computations in a distributed manner by multiple processes and machines.
TileDB currently integrates only with Dask arrays, but we are working on adding support for Dask dataframes. See our roadmap for updates.
Our examples focus only on a single machine, but will work on an arbitrary Dask cluster. Describing how to deploy a Dask cluster though is out of the scope of these docs.
You can install TileDB and Dask as follows: