Dask: setting index on a big dataframe results in high disk space usage during processing

Question

I am working with a large dataset (220 000 000 rows, ~25Gb as csv files) which is stored as a handful of csv files.

I have already managed to read these csv with Dask and save the data as a parquet file with the following:

import pandas as pd
from dask.distributed import Client
import dask.dataframe as dd
client = Client()

init_fields = {
# definition of csv fields
}

raw_data_paths = [
# filenames with their path
]

read_csv_kwargs = dict(
    sep=";",
    header=None,
    names=list(init_fields.keys()),      
    dtype=init_fields, 
    parse_dates=['date'],    
)

ddf = dd.read_csv(
    raw_data_paths,
    **read_csv_kwargs,
)
ddf.to_parquet(persist_path / 'raw_data.parquet')

It works like a charm, and completes within minutes. I get a parquet file holding a Dask Dataframe with 455 partitions which I can totally use.

However, this dataframe consists in a huge list of client orders, which I would like to index by date for further processing.

When I try to run the code with the adjustment below:

ddf = dd.read_csv(
    raw_data_paths,
    **read_csv_kwargs,
).set_index('date')
ddf.to_parquet(persist_path / 'raw_data.parquet')

the processing gets really long, with 26 000+ tasks (I can understand that, that's a lot of data to sort) but workers start dying after a while from using to much memory.

With each worker death, some progress is lost and it seems that the processing will never complete.

I have noticed that the workers deaths are related to the disk of my machine reaching its limit, and whenever a worker dies some space is freed. At the beginning of the processing, I have about 37 Gb of free disk space.

I am quite new to Dask, so have a few questions about that:

Is setting the index before dumping in a parquet file a good idea? I have several groupby date to come for the next steps, and as per the Dask documentation using this field as index seemed to me to be a good idea.
If I manage to set the index before dumping as a parquet file, will the parquet file be sorted and my further processing require no more shuffling?
Does the above described behaviour (high disk usage into memory error) seem normal or is something odd in my setup or use of Dask? Are there some parameters that I could tweak?
Or I really need more disk space, because sorting so much data requires it? What would be an estimation of the total disk space required?

Thanks in advance!

EDIT: I finally managed to set the index by:

adding disk space on my machine
tweaking the client parameters to have more memory per worker

The parameters I used were:

client = Client(
    n_workers=1,
    threads_per_worker=8,
    processes=True,
    memory_limit='31GB'
)

I am less adamant that the disk usage was the root cause of my workers dying from lack of memory, because increasing disk space alone did not enable the processing to complete. It also required that memory per worker was extended, which I achieved by creating a single worker with the whole memory of my machine.

However, I am quite surprised that that much memory was needed. I thought that one of the aim of Dask (and other big data tools) was to enable "out of core processing". Am I doing something wrong here or setting an index requires a big amount of memory, no matter what?

Regards,

Answer 1

Here's how I understand things, but I might be missing some important points.

Let's start with a nice indexed dataset to have a reproducible example.

import dask
import dask.dataframe as dd

df = dask.datasets.timeseries(start='2000-01-01', end='2000-01-2', freq='2h', partition_freq='12h')

print(len(df), df.npartitions)
# 12 2

So we are dealing with a tiny dataset, just 12 rows, split across 2 partitions. Since this dataframe is indexed, merges on it will be very fast, because dask knows which partitions contain which (index) values.

%%time
_ = df.merge(df, how='outer', left_index=True, right_index=True).compute()
#CPU times: user 25.7 ms, sys: 4.23 ms, total: 29.9 ms
#Wall time: 27.7 ms

Now if we try to merge on a non-index column, dask will not know which partition contains which values, so it will have to exchange information between workers and transfer bits of data among workers.

%%time
_ = df.merge(df, how='outer', on=['name']).compute()
#CPU times: user 82.3 ms, sys: 8.19 ms, total: 90.4 ms
#Wall time: 85.4 ms

This might not seem much on this small dataset, but compare it to the time pandas would take:

%%time
_ = df.compute().merge(df.compute(), how='outer', on=['name'])
#CPU times: user 18.9 ms, sys: 3.39 ms, total: 22.3 ms
#Wall time: 19.7 ms

Another way to see this is with the DAGs, compare the DAG for the merge with indexed columns to DAG for the merge with non-indexed column. The first one is nicely parallel:

The second one (using non-indexed column) is a lot more complex:

So what happens as the size of data grows, is it becomes and more expensive to perform operations with non-indexed columns. This is especially true for columns that contain many unique values (eg strings). You can experiment with increasing the number of partitions in the dataframe df constructed above, and you will observe how the non-indexed case becomes more and more complex, while DAG for indexed data remains scaleable.

Going back to your specific case, you are starting with a non-indexed dataframe, which after indexing is going to be a pretty complex entity. You can see the DAG for the indexed dataframe with .visualize() , and from experience I can guess it will not look pretty.

So when you are saving to parquet (or initiating other computation of the dataframe), workers begin to shuffle data around, which will eat up memory quickly (especially if there are many columns and/or many partitions and/or columns have a lot of unique values). Once the worker memory limit is close, workers will start spilling data to disk (if they are allowed to), which is why you were able to complete your task by increasing both memory and available disk space.

In a situation where neither of those options is possible, you might need to implement custom workflow that uses delayed API (or futures for dynamic graphs), such that this workflow makes use of some information that is not explicitly available to dask. For example, if the original csv files were partitioned by a column of interest, you might want to process these csv files in separate batches, rather than ingesting them into a single dask dataframe and then indexing.

Dask: setting index on a big dataframe results in high disk space usage during processing

Question

1 answers

solution1
0 2021-03-05 10:39:04

Dask: setting index on a big dataframe results in high disk space usage during processing

Question

1 answers

solution1 0 2021-03-05 10:39:04

solution1
0 2021-03-05 10:39:04