Parallel Processing with Dask

Hasan Özdemir
4 min readJul 3, 2022

--

Hello everyone today we will be covering two main topics. Parallel processing or parallel computing and Dask. Let’s get started.

What is parallel computing ?

Parallel processing means to executing multiple tasks at the same time, using multiple processors in the same machine. We mostly face two type of flow in our daily codes. First one is linearly, which mean each code (task) dependent each other and must be completed one then other. There is also type of flow which is not linear and do not need to wait or depend on previous or next code base

The question is comes out when your project need to be parallel computing. Because basically if your project parallel and if we are applying sequential approach basically we are with our hand making our code is slow. The solution is we should apply parallel processing approach and libraries which is supporting parallel processing to compute multiple steps simultaneously at the same time.

Dask

Dask accelaretes the existing Python ecosystem. Libraries like numpy, pandas, scikit-learn are popular today because they combine high-level usable and flexible api’s with high performance implementations however this libraries were not originally designed to scale beyond single CPU or to data that doesn’t fit in memory. This often end up with memory errors or switching libraries when you run into larger datasets. This is what Dask can help fix.

Dask is a open-source library that provides advanced parallelization for analytics, especially when you are working with large data. Dask is a platform to build distributed applications.

Dask Code Examples

Reading data with Dask

Dask has very easy and good dealing with all popular libraries so after installing Dask you can easily deal almost all operations without changing code so much. Dask also gives you new options to set up your operations.

Let’s look at above code example. Dask DataFrames are composed of multiple partitions, each of which is a pandas DataFrame. Dask on purpose splits up the data into multiple pandas dataframes so operations can be performed on multiple slices of the data in parallel.

Each portition size depends on the parameter which you can pass to read_csv method.

salaries_df = dd.read_csv("salaries.csv", blocksize="16MB")

In the above example we setted up blocksize to 16MB and then each portition will be 16MB but if you do not supply this it will be setted automatically regarding your number of cores and available memory.

Specifiying Data Types

CSV file is a text-based file format and does not contain metadata information about the data types or columns. If you want to make it the process more efficient you can set it as it shown in below.

salaries_df = dd.read_csv("salaries.csv", blocksize="16MB"
dtype={"company_name": "string[pyarrow]")

Text Processing Pipeline with Dask

Installing Packages

pip install dask_ml
pip install dask

Importing Packages

from dask_ml.preprocessing import Categorizer, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import pandas as pd
import dask.dataframe as dd

dask_ml.preprocessing contains some scikit-learn style transformers that can be used in Pipelines to perform various data transformations as part of the model fitting process. Transformers work very nicely on dask or pandas collections (dask.dataframe, dask.array, pd.array, pd.dataframe)

Creating a dataframe

You can create a Dask DataFrame from various data storage formats like CSV, HDF, Apache Parquet, and others. In our example to make it simple and understandable every one we will create from not data source but from Python dictionary.

df = pd.DataFrame({"A": [1, 2, 3, 4, 5, 6], "B": ["a", "b", "c", "d", "e", "f"]})

x = dd.from_pandas(df, npartitions=2)

y = dd.from_pandas(pd.Series([0, 1, 1, 0]), npartitions=2)

Creating a pipeline

This pipeline basically process the data through given categorizers and fit all the transformers.

pipeline = make_pipeline(
Categorizer(),
OneHotEncoder(),
LogisticRegression(solver='lbfgs')
)
pipe.fit(x, y)

I will be providing more advanced tutorials about Dask and parallel Computing. Thank you for being patient and reading my blog.

Further Reading

  1. docs.dask.org
  2. scikit-learn.org
  3. Distributed Computing @ University Cambridge

--

--