Quickstart
Last updated
Was this helpful?
Last updated
Was this helpful?
is a great library for parallel computing in Python. It can work on your laptop with multiple threads and processes, or on a large cluster. We will take advantage of two very appealing Dask features:
Dynamic task scheduling. We can create arbitrarily complex task graphs using dask.delayed
and let Dask execute them in parallel in our cluster.
Parallel arrays and dataframes. dask.array
and dask.dataframe
work similar to numpy arrays and Pandas dataframes, respectively, but they are extended to work for datasets larger than the main memory and perform computations in a distributed manner by multiple processes and machines.
Our examples focus only on a single machine, but will work on an arbitrary Dask cluster. Describing how to deploy a Dask cluster though is out of the scope of these docs.
You can install TileDB and Dask as follows: