如何在python中处理非常大的文件（13GB）而不会崩溃？

Question

我必须在服务器（不是我的电脑）上处理这个非常大的文件。 它运行 python 64 并具有 24 GB 的 RAM。 该文件本身大小约为 13GB，包含 2700 万行数据。 考虑到服务器有很大的规格，我确实尝试将整个加载到 Pandas，但它崩溃了。 我尝试使用 dask，但它仍然很慢。 所以我将文件分成如下的块。

我的代码与下面的类似。 我分块加载文件，每个块是 100,000 行数据。 然后它将处理每个块，并将其附加到现有文件中。 我认为通过分块处理事物，它不会将数据存储在 RAM 中，但我认为它仍然存在。 前几百次迭代运行良好，但在处理了 8GB 数据后的某个时候，它就崩溃了。

chunksize= 100000
c = 0
for chunk in pd.read_csv(fname, chunksize=chunksize,sep='|',error_bad_lines=False):

    chunk['col1'] = chunk['col1'].apply(process1)
    chunk['col2'] = chunk['col2'].apply(process2)

    if c == 0:
        chunk.to_csv("result/result.csv", index=False)
    else:
        chunk.to_csv('result/result.csv', mode='a', header=False, index=False)

    if c%10==0:
        print(c)
        
    c+=1

通常在 160 次迭代后产生 8 GB 的 result.csv 文件，程序只是停止并出现MemoryError: 。

老实说，我无法访问此服务器中的许多内容，因此如果您想建议更改一些我无法访问的设置，那么我可能无法访问。 但让我们看看我能做什么。 提前致谢。

编辑：我会添加什么样的事情在process1和process2在这里。

def process1(name):
    if type(name)==str:
        new_name = name[:3]+'*' * len(name[:-3])
    else:
        return name
    
    return new_name

def process2(number):
    if number !=np.nan:
        new_number = str(number)
        new_number = '*'*len(new_number)
        return new_number
    else:
        return number

Answer 1

for循环的一般语法是

for target in expression:
    do all the things

Python 会将表达式计算为一个对象，并且只有在完成时，它才会将该对象分配给目标变量。 这意味着任何已经在target中的target在其替换被构建之前不会被删除。

除非正在创建的对象很大，否则这没什么大不了的。 这里就是这种情况。 当新的块被创建时，即将被删除的块在内存中，有效地加倍了对内存的影响。 解决方法是在返回更多之前手动删除循环中的目标。

for chunk in pd.read_csv(fname, chunksize=chunksize,sep='|',error_bad_lines=False):

    chunk['col1'] = chunk['col1'].apply(process1)
    chunk['col2'] = chunk['col2'].apply(process2)

    if c == 0:
        chunk.to_csv("result/result.csv", index=False)
    else:
        chunk.to_csv('result/result.csv', mode='a', header=False, index=False)
    del chunk # destroy dataframe before next loop to conserve memory.    
    if c%10==0:
        print(c)
    c+=1

如何在python中处理非常大的文件（13GB）而不会崩溃？

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-10-22 04:33:35

如何在python中处理非常大的文件（13GB）而不会崩溃？

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-10-22 04:33:35

解决方案1
3 已采纳 2020-10-22 04:33:35