简体   繁体   English

在Python中处理文件时出现内存错误

[英]Memory error when processing files in Python

I intend to read a file, which is about 500MB in total, into a dict according to the key in each line. 我打算根据每一行的密钥将一个总共约500MB的文件读入字典。 The code snippet is as follows: 代码片段如下:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = [l.strip() for l in f2.readlines() if l.strip()]
sample = dict([(l.split("\t")[2].strip("\""), l) for l in lines])    ## convert [(1,2), (3,4)] to {1:2, 3:4}

When running on a machine with memory of 4GB, the python complains Memory Error. 在内存为4GB的计算机上运行时,python会抱怨内存错误。 If I change the evaluation expression of sample variable to [l for l in lines] , it works fine. 如果我将sample变量的评估表达式更改为[l for l in lines]则可以正常工作。

At first, I thought it was due to the split method that was consuming lots of memory, so I adjust my code to this: 起初,我认为这是由于split方法占用了大量内存,因此我将代码调整为:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

...

sample = dict([(l[find_nth(l, "\t", 4):].strip(), l) for l in lines])

But it turns out the same. 但是事实证明是一样的。

A new discovery is that it will run normally without OOM provided I remove the dict() conversion regardless of the code logic. 一个新的发现是,如果我删除dict()转换而不管代码逻辑如何,它将在没有OOM的情况下正常运行。

Could anyone give me some idea on this problem? 谁能给我一些关于这个问题的想法?

You're creates a list containing every line, which will continue to exist until lines goes out of scope, then creating another big list of entirely different strings based off of it, then a dict off of that before it can go out of memory. 你创建了一个包含每一行,这将继续存在,直到列表lines超出范围,然后创建基于关闭它完全不同的字符串的另一大名单,那么dict掉的那个,才可以走出去的记忆。 Just build the dict in one step. 只需一步就可以构建该dict

with open("ENST-NM-chr-name.txt") as f:
    sample = {}

    for l in f:
        l = l.strip()

        if l:
            sample[l.split("\t")[2].strip('"')] = l

You can achieve about the same effect by using a generator expression instead of a list comprehension, but it feels nicer (to me) not to strip twice. 通过使用生成器表达式而不是列表推导,您可以达到大致相同的效果,但是(对我而言)不strip两次感觉更好。

What if you turn your list into a generator, and your dict into a lovely dictionary comprehension : 如果将列表变成生成器,而字典变成漂亮的字典理解 ,该怎么办:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = (l.strip() for l in f2 if l.strip())
sample = {line.split('\t')[2].strip('\"'): line for line in lines}

Line 2 above was mistakenly lines = (l.strip() for l in f2.readlines() if l.strip()) 上面的第2 lines = (l.strip() for l in f2.readlines() if l.strip())错误地是lines = (l.strip() for l in f2.readlines() if l.strip())

Do a generator and a dict comprehension perhaps (somehow) alleviate the memory requirements? 生成器和dict理解是否可以(以某种方式)减轻内存需求?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM