Pandas 数据帧太大到 append 到 dask dataframe？

Question

I'm not sure what I'm missing here, I thought dask would resolve my memory issues.我不确定我在这里缺少什么，我认为 dask 会解决我的 memory 问题。 I have 100+ pandas dataframes saved in.pickle format.我有 100 多个 pandas 数据帧以 .pickle 格式保存。 I would like to get them all in the same dataframe but keep running into memory issues.我想将它们全部放在同一个 dataframe 中，但不断遇到 memory 问题。 I've already increased the memory buffer in jupyter.我已经在 jupyter 中增加了 memory 缓冲区。 It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe).似乎我在创建 dask dataframe 时可能遗漏了一些东西，因为它似乎在完全填满我的 RAM 后使我的笔记本崩溃（也许）。 Any pointers?任何指针？

Below is the basic process I used:以下是我使用的基本流程：

import pandas as pd
import dask.dataframe as dd

ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
    ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')

I've tried a variety of npartitions but no number has allowed the code to finish running.我尝试了各种npartitions ，但没有数字允许代码完成运行。
all in all there is about 30GB of pickled dataframes I'd like to combine总而言之，我想组合大约 30GB 的腌制数据帧
perhaps this is not the right library but the docs suggest dask should be able to handle this也许这不是正确的库，但文档建议 dask 应该能够处理这个

Answer 1

Have you considered to first convert the pickle files to parquet and then load to dask?您是否考虑过先将pickle文件转换为parquet ，然后加载到 dask？ I assume that all your data is in a folder called raw and you want to move to processed我假设您的所有数据都在一个名为raw的文件夹中，并且您想移至已processed

import pandas as pd
import dask.dataframe as dd
import os

def convert_to_parquet(fn, fldr_in, fldr_out):
    fn_out = fn.replace(fldr_in, fldr_out)\
               .replace(".pickle", ".parquet")
    df = pd.read_pickle(fn)
    # eventually change dtypes
    
    df.to_parquet(fn_out, index=False)

fldr_in = 'data'
fldr_out = 'processed'
os.makedirs(fldr_out, exist_ok=True)

# you could use glob if you prefer
fns = os.listdir(fldr_in)
fns = [os.path.join(fldr_in, fn) for fn in fns]

If you know than no more than one file fits in memory you should use a loop如果您知道不超过一个文件适合 memory 您应该使用循环

for fn in fns:
    convert_to_parquet(fn, fldr_in, fldr_out)

If you know that more files fit in memory you can use delayed如果您知道更多文件适合 memory 您可以使用delayed

from dask import delayed, compute

# this is lazy
out = [delayed(fun)(fn) for fn in fns]
# now you are actually converting
out = compute(out)

Now you can use dask to do your analysis.现在您可以使用 dask 进行分析。

Pandas 数据帧太大到 append 到 dask dataframe？

问题描述

1 个解决方案

解决方案1
1 2020-08-05 01:55:55

Pandas 数据帧太大到 append 到 dask dataframe？

问题描述

1 个解决方案

解决方案1 1 2020-08-05 01:55:55

解决方案1
1 2020-08-05 01:55:55