[英]Pandas dataframes too large to append to dask dataframe?
I'm not sure what I'm missing here, I thought dask would resolve my memory issues.我不确定我在这里缺少什么,我认为 dask 会解决我的 memory 问题。 I have 100+ pandas dataframes saved in.pickle format.我有 100 多个 pandas 数据帧以 .pickle 格式保存。 I would like to get them all in the same dataframe but keep running into memory issues.我想将它们全部放在同一个 dataframe 中,但不断遇到 memory 问题。 I've already increased the memory buffer in jupyter.我已经在 jupyter 中增加了 memory 缓冲区。 It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe).似乎我在创建 dask dataframe 时可能遗漏了一些东西,因为它似乎在完全填满我的 RAM 后使我的笔记本崩溃(也许)。 Any pointers?任何指针?
Below is the basic process I used:以下是我使用的基本流程:
import pandas as pd
import dask.dataframe as dd
ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')
npartitions
but no number has allowed the code to finish running.我尝试了各种npartitions
,但没有数字允许代码完成运行。 Have you considered to first convert the pickle
files to parquet
and then load to dask?您是否考虑过先将pickle
文件转换为parquet
,然后加载到 dask? I assume that all your data is in a folder called raw
and you want to move to processed
我假设您的所有数据都在一个名为raw
的文件夹中,并且您想移至已processed
import pandas as pd
import dask.dataframe as dd
import os
def convert_to_parquet(fn, fldr_in, fldr_out):
fn_out = fn.replace(fldr_in, fldr_out)\
.replace(".pickle", ".parquet")
df = pd.read_pickle(fn)
# eventually change dtypes
df.to_parquet(fn_out, index=False)
fldr_in = 'data'
fldr_out = 'processed'
os.makedirs(fldr_out, exist_ok=True)
# you could use glob if you prefer
fns = os.listdir(fldr_in)
fns = [os.path.join(fldr_in, fn) for fn in fns]
If you know than no more than one file fits in memory you should use a loop如果您知道不超过一个文件适合 memory 您应该使用循环
for fn in fns:
convert_to_parquet(fn, fldr_in, fldr_out)
If you know that more files fit in memory you can use delayed
如果您知道更多文件适合 memory 您可以使用delayed
from dask import delayed, compute
# this is lazy
out = [delayed(fun)(fn) for fn in fns]
# now you are actually converting
out = compute(out)
Now you can use dask to do your analysis.现在您可以使用 dask 进行分析。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.