使用 dask.DataFrame.to_parquet() 写入大文件

Question

I have a.pq file (about 2Gb) in which I want to change a column name using dask.我有一个 .pq 文件（大约 2Gb），我想在其中使用 dask 更改列名。

I have no problems reading the file to dask DataFrame and also I'm able to rename columns.我在读取文件以 dask DataFrame 时没有问题，而且我还可以重命名列。 But when it comes to writing the.pq file back to disk with ddf.to_parquet(), the job fails as it seems that dask tries to fit it in memory (and it doesn't fit).但是，当使用 ddf.to_parquet() 将 .pq 文件写回磁盘时，该作业会失败，因为 dask 似乎试图将其放入 memory 中（但它不适合）。

Why does this happen?为什么会这样？ I expected that dask would do this iteratively.我预计 dask 会迭代地执行此操作。 How can I write the target file in chunks?如何分块写入目标文件？

Below is the code that I'm using.下面是我正在使用的代码。

import dask.dataframe as dd

ddf = dd.read_parquet(
    '/path/to/file/file.pq',
    engine='pyarrow'
)

ddf = ddf.rename(columns={'old_column_name': 'new_column_name'})

# the step which fails
ddf.to_parquet(
    '/path/to/file/edited/',
    engine='pyarrow',
    write_index=False
)

Thanks in advance!提前致谢！

Answer 1

Dask does indeed load your data in chunks, and writes them to the output in chunks. Dask 确实以块的形式加载您的数据，并将它们以块的形式写入 output。 The total memory usage will depend on memory 的总使用量取决于

the size of each chunk, known as "row-groups" in parquet, which are not divisible.每个块的大小，在 parquet 中称为“行组”，不可分割。 You need the in-memory size, after decompression and decoding解压解码后需要内存大小
the number of chunks you process at once, which may be the number of cores in your CPU if you don't configure otherwise您一次处理的块数，如果您不进行其他配置，这可能是您的 CPU 中的内核数

Note that some intermediate values are also needed during processing, so you generally want each thread of each worker to be able to fit a god deal more than just one chunk's worth of data.请注意，在处理过程中还需要一些中间值，因此您通常希望每个 worker 的每个线程都能够适应一个上帝的交易，而不仅仅是一个块的数据价值。

使用 dask.DataFrame.to_parquet() 写入大文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-05-14 16:02:43

使用 dask.DataFrame.to_parquet() 写入大文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-05-14 16:02:43

解决方案1
0 已采纳 2021-05-14 16:02:43