简体   繁体   English

如何在不遇到 MemoryError 的情况下连接多个 pandas.DataFrames

[英]How to concatenate multiple pandas.DataFrames without running into MemoryError

I have three DataFrames that I'm trying to concatenate.我有三个要连接的 DataFrame。

concat_df = pd.concat([df1, df2, df3])

This results in a MemoryError.这会导致 MemoryError。 How can I resolve this?我该如何解决这个问题?

Note that most of the existing similar questions are on MemoryErrors occuring when reading large files.请注意,大多数现有的类似问题都是关于读取大文件时发生的 MemoryErrors。 I don't have that problem.我没有那个问题。 I have read my files in into DataFrames.我已将我的文件读入 DataFrames。 I just can't concatenate that data.我只是无法连接这些数据。

The problem is, like viewed in the others answers, a problem of memory.问题是,就像在其他答案中看到的那样,是记忆问题。 And a solution is to store data on disk, then to build an unique dataframe.一个解决方案是将数据存储在磁盘上,然后构建一个唯一的数据帧。

With such huge data, performance is an issue.对于如此庞大的数据,性能是一个问题。

csv solutions are very slow, since conversion in text mode occurs. csv 解决方案非常慢,因为会在文本模式下进行转换。 HDF5 solutions are shorter, more elegant and faster since using binary mode.由于使用二进制模式,HDF5 解决方案更短、更优雅、更快。 I propose a third way in binary mode, with pickle , which seems to be even faster, but more technical and needing some more room.我在二进制模式中提出了第三种方式,使用pickle ,它似乎更快,但更具技术性并且需要更多空间。 And a fourth, by hand.第四个,手动。

Here the code:这里的代码:

import numpy as np
import pandas as pd

# a DataFrame factory:
dfs=[]
for i in range(10):
    dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))

# a csv solution
def bycsv(dfs):
    md,hd='w',True
    for df in dfs:
        df.to_csv('df_all.csv',mode=md,header=hd,index=None)
        md,hd='a',False
    #del dfs
    df_all=pd.read_csv('df_all.csv',index_col=None)
    os.remove('df_all.csv') 
    return df_all    

Better solutions :更好的解决方案:

def byHDF(dfs):
    store=pd.HDFStore('df_all.h5')
    for df in dfs:
        store.append('df',df,data_columns=list('0123'))
    #del dfs
    df=store.select('df')
    store.close()
    os.remove('df_all.h5')
    return df

def bypickle(dfs):
    c=[]
    with open('df_all.pkl','ab') as f:
        for df in dfs:
            pickle.dump(df,f)
            c.append(len(df))    
    #del dfs
    with open('df_all.pkl','rb') as f:
        df_all=pickle.load(f)
        offset=len(df_all)
        df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))

        for size in c[1:]:
            df=pickle.load(f)
            df_all.iloc[offset:offset+size]=df.values 
            offset+=size
    os.remove('df_all.pkl')
    return df_all

For homogeneous dataframes, we can do even better :对于同类数据帧,我们可以做得更好:

def byhand(dfs):
    mtot=0
    with open('df_all.bin','wb') as f:
        for df in dfs:
            m,n =df.shape
            mtot += m
            f.write(df.values.tobytes())
            typ=df.values.dtype                
    #del dfs
    with open('df_all.bin','rb') as f:
        buffer=f.read()
        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
        df_all=pd.DataFrame(data=data,columns=list(range(n))) 
    os.remove('df_all.bin')
    return df_all

And some tests on (little, 32 Mb) data to compare performance.并对(小,32 Mb)数据进行一些测试以比较性能。 you have to multiply by about 128 for 4 Gb.对于 4 Gb,您必须乘以大约 128。

In [92]: %time w=bycsv(dfs)
Wall time: 8.06 s

In [93]: %time x=byHDF(dfs)
Wall time: 547 ms

In [94]: %time v=bypickle(dfs)
Wall time: 219 ms

In [95]: %time y=byhand(dfs)
Wall time: 109 ms

A check :支票:

In [195]: (x.values==w.values).all()
Out[195]: True

In [196]: (x.values==v.values).all()
Out[196]: True

In [197]: (x.values==y.values).all()
Out[196]: True

Of course all of that must be improved and tuned to fit your problem.当然,所有这些都必须改进和调整以适应您的问题。

For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle .例如,df3 可以分成大小为 'total_memory_size - df_total_size' 的块,以便能够运行bypickle

I can edit it if you give more information on your data structure and size if you want.如果你愿意的话,如果你提供更多关于你的数据结构和大小的信息,我可以编辑它。 Beautiful question !漂亮的问题!

I advice you to put your dataframes into single csv file by concatenation.我建议您通过串联将数据帧放入单个 csv 文件中。 Then to read your csv file.然后读取您的csv文件。

Execute that:执行:

# write df1 content in file.csv
df1.to_csv('file.csv', index=False)
# append df2 content to file.csv
df2.to_csv('file.csv', mode='a', columns=False, index=False)
# append df3 content to file.csv
df3.to_csv('file.csv', mode='a', columns=False, index=False)

# free memory
del df1, df2, df3

# read all df1, df2, df3 contents
df = pd.read_csv('file.csv')

If this solution isn't enougth performante, to concat larger files than usually.如果此解决方案的性能不够,则连接比通常更大的文件。 Do:做:

df1.to_csv('file.csv', index=False)
df2.to_csv('file1.csv', index=False)
df3.to_csv('file2.csv', index=False)

del df1, df2, df3

Then run bash command:然后运行 ​​bash 命令:

cat file1.csv >> file.csv
cat file2.csv >> file.csv
cat file3.csv >> file.csv

Or concat csv files in python :或者在 python 中连接 csv 文件:

def concat(file1, file2):
    with open(file2, 'r') as filename2:
        data = file2.read()
    with open(file1, 'a') as filename1:
        file.write(data)

concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')

After read:阅读后:

df = pd.read_csv('file.csv')

Kinda taking a guess here, but maybe:有点猜测这里,但也许:

df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3

Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go.显然,您可以循环执行更多操作,但关键是您要随时删除 df2、df3 等。 As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.当您在问题中这样做时,您永远不会清除旧的数据帧,因此您使用的内存大约是您需要的两倍。

More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):更一般地说,如果您正在阅读和连接,我会这样做(如果您有 3 个 CSV:foo0、foo1、foo2):

concat_df = pd.DataFrame()
for i in range(3):
    temp_df = pd.read_csv('foo'+str(i)+'.csv')
    concat_df = pd.concat( [concat_df, temp_df] )

In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df.换句话说,当您读取文件时,您只会将小数据帧暂时保留在内存中,直到将它们连接成组合的 df concat_df。 As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.正如您目前所做的那样,您将保留所有较小的数据帧,即使在连接它们之后也是如此。

Similar to what @glegoux suggests, also pd.DataFrame.to_csv can write in append mode, so you can do something like:类似@glegoux暗示什么,也pd.DataFrame.to_csv可以追加方式写的,所以你可以喜欢做一些事情:

df1.to_csv(filename)
df2.to_csv(filename, mode='a', columns=False)
df3.to_csv(filename, mode='a', columns=False)

del df1, df2, df3
df_concat = pd.read_csv(filename)

Dask 可能是尝试处理大型数据帧的不错选择 - 浏览Dask Docs

You can store your individual dataframes in a HDF Store , and then call the store just like one big dataframe.您可以将您的单个数据帧存储在 HDF Store 中,然后像调用一个大数据帧一样调用该存储。

# name of store
fname = 'my_store'

with pd.get_store(fname) as store:

    # save individual dfs to store
    for df in [df1, df2, df3, df_foo]:
        store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending

    # access the store as a single df
    df = store.select('df', where = ['A>2'])  # change where condition as required (see documentation for examples)
    # Do other stuff with df #

# close the store when you're done
os.remove(fname)

I'm grateful to the community for their answers.我感谢社区的回答。 However, in my case, I found out that the problem was actually due to the fact that I was using 32 bit Python.但是,就我而言,我发现问题实际上是由于我使用的是 32 位 Python。

There are memory limits defined for Windows 32 and 64 bit OS.为 Windows 32 和 64 位操作系统定义了内存限制 For a 32 bit process , it is only 2 GB.对于 32 位进程,它只有 2 GB。 So, even if your RAM has more than 2GB, and even if you're running the 64 bit OS, but you are running a 32 bit process, then that process will be limited to just 2 GB of RAM - in my case that process was Python.因此,即使您的 RAM 超过 2GB,即使您运行的是 64 位操作系统,但您运行的是 32 位进程,那么该进程也将仅限于 2 GB 的 RAM - 在我的情况下,该进程是 Python。

I upgraded to 64 bit Python, and haven't had a memory error since then!我升级到 64 位 Python,从那时起就没有出现内存错误!

Other relevant questions are: Python 32-bit memory limits on 64bit windows , Should I use Python 32bit or Python 64bit , Why is this numpy array too big to load?其他相关问题是: Python 32-bit memory limits on 64bit windows我应该使用 Python 32bit 还是 Python 64bit为什么这个 numpy 数组太大而无法加载?

Another option:另一种选择:

1) Write df1 to .csv file: df1.to_csv('Big file.csv') 1) 将df1写入 .csv 文件: df1.to_csv('Big file.csv')

2) Open .csv file, then append df2 : 2)打开 .csv 文件,然后附加df2

with open('Big File.csv','a') as f:
    df2.to_csv(f, header=False)

3) Repeat Step 2 with df3 3) 用df3重复步骤 2

with open('Big File.csv','a') as f:
    df3.to_csv(f, header=False)

I've had a similar performance issues while trying to concatenate a large number of DataFrames to a 'growing' DataFrame.我在尝试将大量 DataFrame 连接到“不断增长”的 DataFrame 时遇到了类似的性能问题。

My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed.我的解决方法是将所有子数据帧附加到一个列表中,然后在子数据帧的处理完成后连接数据帧列表。 This will bring the runtime to almost half.这将使运行时间几乎减少一半。

While writing to hard disk, df.to_csv throws an error for columns=False .写入硬盘时, df.to_csv会为columns=False引发错误。

The below solutions works fine:以下解决方案工作正常:

# write df1 to hard disk as file.csv
train1.to_csv('file.csv', index=False)
# append df2 to file.csv
train2.to_csv('file.csv', mode='a', header=False, index=False)
# read the appended csv as df
train = pd.read_csv('file.csv')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM