[英]Using dask.DataFrame.to_parquet() to write large file
I have a.pq file (about 2Gb) in which I want to change a column name using dask.我有一个 .pq 文件(大约 2Gb),我想在其中使用 dask 更改列名。
I have no problems reading the file to dask DataFrame and also I'm able to rename columns.我在读取文件以 dask DataFrame 时没有问题,而且我还可以重命名列。 But when it comes to writing the.pq file back to disk with ddf.to_parquet(), the job fails as it seems that dask tries to fit it in memory (and it doesn't fit).
但是,当使用 ddf.to_parquet() 将 .pq 文件写回磁盘时,该作业会失败,因为 dask 似乎试图将其放入 memory 中(但它不适合)。
Why does this happen?为什么会这样? I expected that dask would do this iteratively.
我预计 dask 会迭代地执行此操作。 How can I write the target file in chunks?
如何分块写入目标文件?
Below is the code that I'm using.下面是我正在使用的代码。
import dask.dataframe as dd
ddf = dd.read_parquet(
'/path/to/file/file.pq',
engine='pyarrow'
)
ddf = ddf.rename(columns={'old_column_name': 'new_column_name'})
# the step which fails
ddf.to_parquet(
'/path/to/file/edited/',
engine='pyarrow',
write_index=False
)
Thanks in advance!提前致谢!
Dask does indeed load your data in chunks, and writes them to the output in chunks. Dask 确实以块的形式加载您的数据,并将它们以块的形式写入 output。 The total memory usage will depend on
memory 的总使用量取决于
Note that some intermediate values are also needed during processing, so you generally want each thread of each worker to be able to fit a god deal more than just one chunk's worth of data.请注意,在处理过程中还需要一些中间值,因此您通常希望每个 worker 的每个线程都能够适应一个上帝的交易,而不仅仅是一个块的数据价值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.