Dask：在大型 dataframe 上设置索引会导致处理期间的磁盘空间使用率很高

Question

I am working with a large dataset (220 000 000 rows, ~25Gb as csv files) which is stored as a handful of csv files.我正在处理一个大型数据集（220 000 000 行，~25Gb 作为 csv 文件），它存储为少数 csv 文件。

I have already managed to read these csv with Dask and save the data as a parquet file with the following:我已经设法用 Dask 阅读了这些 csv 并将数据保存为镶木地板文件，其中包含以下内容：

import pandas as pd
from dask.distributed import Client
import dask.dataframe as dd
client = Client()

init_fields = {
# definition of csv fields
}

raw_data_paths = [
# filenames with their path
]

read_csv_kwargs = dict(
    sep=";",
    header=None,
    names=list(init_fields.keys()),      
    dtype=init_fields, 
    parse_dates=['date'],    
)

ddf = dd.read_csv(
    raw_data_paths,
    **read_csv_kwargs,
)
ddf.to_parquet(persist_path / 'raw_data.parquet')

It works like a charm, and completes within minutes.它就像一个魅力，并在几分钟内完成。 I get a parquet file holding a Dask Dataframe with 455 partitions which I can totally use.我得到一个镶木地板文件，里面有一个 Dask Dataframe，有 455 个分区，我完全可以使用。

However, this dataframe consists in a huge list of client orders, which I would like to index by date for further processing.但是，此 dataframe 包含大量客户订单，我想按日期对其进行索引以便进一步处理。

When I try to run the code with the adjustment below:当我尝试通过以下调整运行代码时：

ddf = dd.read_csv(
    raw_data_paths,
    **read_csv_kwargs,
).set_index('date')
ddf.to_parquet(persist_path / 'raw_data.parquet')

the processing gets really long, with 26 000+ tasks (I can understand that, that's a lot of data to sort) but workers start dying after a while from using to much memory.处理变得非常长，有 26 000 多个任务（我可以理解，这是要排序的大量数据），但是工作人员在使用 memory 一段时间后开始死亡。

With each worker death, some progress is lost and it seems that the processing will never complete.每个工人死亡，都会失去一些进展，似乎处理永远不会完成。

I have noticed that the workers deaths are related to the disk of my machine reaching its limit, and whenever a worker dies some space is freed.我注意到工人死亡与我机器的磁盘达到极限有关，每当工人死亡时，都会释放一些空间。 At the beginning of the processing, I have about 37 Gb of free disk space.在处理开始时，我有大约 37 Gb 的可用磁盘空间。

I am quite new to Dask, so have a few questions about that:我对 Dask 很陌生，所以有几个问题：

Is setting the index before dumping in a parquet file a good idea?在转储到镶木地板文件之前设置索引是个好主意吗？ I have several groupby date to come for the next steps, and as per the Dask documentation using this field as index seemed to me to be a good idea.在接下来的步骤中，我有几个分组日期，根据 Dask 文档，使用这个字段作为索引在我看来是个好主意。
If I manage to set the index before dumping as a parquet file, will the parquet file be sorted and my further processing require no more shuffling?如果我在转储为 parquet 文件之前设法设置索引，parquet 文件是否会被排序并且我的进一步处理不需要更多的改组？
Does the above described behaviour (high disk usage into memory error) seem normal or is something odd in my setup or use of Dask?上述行为（memory 错误中的磁盘使用率高）看起来正常还是在我的设置或使用 Dask 时有些奇怪？ Are there some parameters that I could tweak?有一些我可以调整的参数吗？
Or I really need more disk space, because sorting so much data requires it?或者我真的需要更多的磁盘空间，因为排序这么多数据需要它？ What would be an estimation of the total disk space required?估计所需的总磁盘空间是多少？

Thanks in advance!提前致谢！

EDIT: I finally managed to set the index by:编辑：我终于设法通过以下方式设置索引：

adding disk space on my machine在我的机器上添加磁盘空间
tweaking the client parameters to have more memory per worker调整客户端参数以使每个工作人员拥有更多 memory

The parameters I used were:我使用的参数是：

client = Client(
    n_workers=1,
    threads_per_worker=8,
    processes=True,
    memory_limit='31GB'
)

I am less adamant that the disk usage was the root cause of my workers dying from lack of memory, because increasing disk space alone did not enable the processing to complete.我不太坚持认为磁盘使用是我的工人死于缺少 memory 的根本原因，因为仅增加磁盘空间并不能完成处理。 It also required that memory per worker was extended, which I achieved by creating a single worker with the whole memory of my machine.它还要求扩展每个工作人员的 memory，这是我通过使用机器的整个 memory 创建一个工作人员来实现的。

However, I am quite surprised that that much memory was needed.但是，我很惊讶需要这么多 memory。 I thought that one of the aim of Dask (and other big data tools) was to enable "out of core processing".我认为 Dask（和其他大数据工具）的目标之一是启用“核心外处理”。 Am I doing something wrong here or setting an index requires a big amount of memory, no matter what?我在这里做错了什么还是设置索引需要大量的 memory，无论如何？

Regards,问候，

Answer 1

Here's how I understand things, but I might be missing some important points.这是我理解事物的方式，但我可能会遗漏一些重要的观点。

Let's start with a nice indexed dataset to have a reproducible example.让我们从一个很好的索引数据集开始，以获得一个可重现的示例。

import dask
import dask.dataframe as dd

df = dask.datasets.timeseries(start='2000-01-01', end='2000-01-2', freq='2h', partition_freq='12h')

print(len(df), df.npartitions)
# 12 2

So we are dealing with a tiny dataset, just 12 rows, split across 2 partitions.所以我们正在处理一个很小的数据集，只有 12 行，分成 2 个分区。 Since this dataframe is indexed, merges on it will be very fast, because dask knows which partitions contain which (index) values.由于此 dataframe 已编入索引，因此对其进行合并将非常快，因为 dask 知道哪些分区包含哪些（索引）值。

%%time
_ = df.merge(df, how='outer', left_index=True, right_index=True).compute()
#CPU times: user 25.7 ms, sys: 4.23 ms, total: 29.9 ms
#Wall time: 27.7 ms

Now if we try to merge on a non-index column, dask will not know which partition contains which values, so it will have to exchange information between workers and transfer bits of data among workers.现在，如果我们尝试在非索引列上进行合并，dask 将不知道哪个分区包含哪些值，因此它必须在工作人员之间交换信息并在工作人员之间传输数据位。

%%time
_ = df.merge(df, how='outer', on=['name']).compute()
#CPU times: user 82.3 ms, sys: 8.19 ms, total: 90.4 ms
#Wall time: 85.4 ms

This might not seem much on this small dataset, but compare it to the time pandas would take:这在这个小数据集上可能看起来不多，但将其与pandas所需的时间进行比较：

%%time
_ = df.compute().merge(df.compute(), how='outer', on=['name'])
#CPU times: user 18.9 ms, sys: 3.39 ms, total: 22.3 ms
#Wall time: 19.7 ms

Another way to see this is with the DAGs, compare the DAG for the merge with indexed columns to DAG for the merge with non-indexed column.另一种查看方式是使用 DAG，将用于合并索引列的 DAG 与用于合并非索引列的 DAG 进行比较。 The first one is nicely parallel:第一个很好地平行：

The second one (using non-indexed column) is a lot more complex:第二个（使用非索引列）要复杂得多：

So what happens as the size of data grows, is it becomes and more expensive to perform operations with non-indexed columns.那么随着数据大小的增长会发生什么，使用非索引列执行操作会变得更加昂贵。 This is especially true for columns that contain many unique values (eg strings).对于包含许多唯一值（例如字符串）的列尤其如此。 You can experiment with increasing the number of partitions in the dataframe df constructed above, and you will observe how the non-indexed case becomes more and more complex, while DAG for indexed data remains scaleable.您可以尝试增加上面构建的 dataframe df中的分区数量，您将观察到非索引情况如何变得越来越复杂，而索引数据的 DAG 保持可扩展性。

Going back to your specific case, you are starting with a non-indexed dataframe, which after indexing is going to be a pretty complex entity.回到你的具体情况，你从一个非索引的 dataframe 开始，在索引之后它将是一个非常复杂的实体。 You can see the DAG for the indexed dataframe with .visualize() , and from experience I can guess it will not look pretty.您可以使用.visualize()查看索引 dataframe 的 DAG，根据经验，我猜它看起来不太漂亮。

So when you are saving to parquet (or initiating other computation of the dataframe), workers begin to shuffle data around, which will eat up memory quickly (especially if there are many columns and/or many partitions and/or columns have a lot of unique values).因此，当您保存到镶木地板（或启动数据帧的其他计算）时，工作人员开始对数据进行混洗，这将很快耗尽 memory（特别是如果有很多列和/或许多分区和/或列有很多唯一值）。 Once the worker memory limit is close, workers will start spilling data to disk (if they are allowed to), which is why you were able to complete your task by increasing both memory and available disk space.一旦工人 memory 限制接近，工人将开始将数据溢出到磁盘（如果允许的话），这就是为什么您能够通过增加 memory 和可用磁盘空间来完成任务的原因。

In a situation where neither of those options is possible, you might need to implement custom workflow that uses delayed API (or futures for dynamic graphs), such that this workflow makes use of some information that is not explicitly available to dask.在这些选项都不可能的情况下，您可能需要实现使用delayed的 API（或动态图的futures ）的自定义工作流，以便此工作流使用一些不明确提供给 dask 的信息。 For example, if the original csv files were partitioned by a column of interest, you might want to process these csv files in separate batches, rather than ingesting them into a single dask dataframe and then indexing.例如，如果原始 csv 文件按感兴趣的列分区，则您可能希望分批处理这些 csv 文件，而不是将它们摄取到单个 dask Z6A8064B5DF4794555500553C47DZC 中。

Dask：在大型 dataframe 上设置索引会导致处理期间的磁盘空间使用率很高

问题描述

1 个解决方案

解决方案1
0 2021-03-05 10:39:04

Dask：在大型 dataframe 上设置索引会导致处理期间的磁盘空间使用率很高

问题描述

1 个解决方案

解决方案1 0 2021-03-05 10:39:04

解决方案1
0 2021-03-05 10:39:04