简体   繁体   English

使用 dask.DataFrame.to_parquet() 写入大文件

[英]Using dask.DataFrame.to_parquet() to write large file

I have a.pq file (about 2Gb) in which I want to change a column name using dask.我有一个 .pq 文件(大约 2Gb),我想在其中使用 dask 更改列名。

I have no problems reading the file to dask DataFrame and also I'm able to rename columns.我在读取文件以 dask DataFrame 时没有问题,而且我还可以重命名列。 But when it comes to writing the.pq file back to disk with ddf.to_parquet(), the job fails as it seems that dask tries to fit it in memory (and it doesn't fit).但是,当使用 ddf.to_parquet() 将 .pq 文件写回磁盘时,该作业会失败,因为 dask 似乎试图将其放入 memory 中(但它不适合)。

Why does this happen?为什么会这样? I expected that dask would do this iteratively.我预计 dask 会迭代地执行此操作。 How can I write the target file in chunks?如何分块写入目标文件?

Below is the code that I'm using.下面是我正在使用的代码。

import dask.dataframe as dd

ddf = dd.read_parquet(
    '/path/to/file/file.pq',
    engine='pyarrow'
)

ddf = ddf.rename(columns={'old_column_name': 'new_column_name'})

# the step which fails
ddf.to_parquet(
    '/path/to/file/edited/',
    engine='pyarrow',
    write_index=False
)

Thanks in advance!提前致谢!

Dask does indeed load your data in chunks, and writes them to the output in chunks. Dask 确实以块的形式加载您的数据,并将它们以块的形式写入 output。 The total memory usage will depend on memory 的总使用量取决于

  • the size of each chunk, known as "row-groups" in parquet, which are not divisible.每个块的大小,在 parquet 中称为“行组”,不可分割。 You need the in-memory size, after decompression and decoding解压解码后需要内存大小
  • the number of chunks you process at once, which may be the number of cores in your CPU if you don't configure otherwise您一次处理的块数,如果您不进行其他配置,这可能是您的 CPU 中的内核数

Note that some intermediate values are also needed during processing, so you generally want each thread of each worker to be able to fit a god deal more than just one chunk's worth of data.请注意,在处理过程中还需要一些中间值,因此您通常希望每个 worker 的每个线程都能够适应一个上帝的交易,而不仅仅是一个块的数据价值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM