在Python中处理文件时出现内存错误

Question

我打算根据每一行的密钥将一个总共约500MB的文件读入字典。 代码片段如下：

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = [l.strip() for l in f2.readlines() if l.strip()]
sample = dict([(l.split("\t")[2].strip("\""), l) for l in lines])    ## convert [(1,2), (3,4)] to {1:2, 3:4}

在内存为4GB的计算机上运行时，python会抱怨内存错误。 如果我将sample变量的评估表达式更改为[l for l in lines]则可以正常工作。

起初，我认为这是由于split方法占用了大量内存，因此我将代码调整为：

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

...

sample = dict([(l[find_nth(l, "\t", 4):].strip(), l) for l in lines])

但是事实证明是一样的。

一个新的发现是，如果我删除dict()转换而不管代码逻辑如何，它将在没有OOM的情况下正常运行。

谁能给我一些关于这个问题的想法？

Answer 1

你创建了一个包含每一行，这将继续存在，直到列表lines超出范围，然后创建基于关闭它完全不同的字符串的另一大名单，那么dict掉的那个，才可以走出去的记忆。 只需一步就可以构建该dict 。

with open("ENST-NM-chr-name.txt") as f:
    sample = {}

    for l in f:
        l = l.strip()

        if l:
            sample[l.split("\t")[2].strip('"')] = l

通过使用生成器表达式而不是列表推导，您可以达到大致相同的效果，但是（对我而言）不strip两次感觉更好。

Answer 2

如果将列表变成生成器，而字典变成漂亮的字典理解 ，该怎么办：

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = (l.strip() for l in f2 if l.strip())
sample = {line.split('\t')[2].strip('\"'): line for line in lines}

上面的第2 lines = (l.strip() for l in f2.readlines() if l.strip())错误地是lines = (l.strip() for l in f2.readlines() if l.strip())

生成器和dict理解是否可以（以某种方式）减轻内存需求？

在Python中处理文件时出现内存错误

问题描述

2 个解决方案

解决方案1
2 2015-04-15 03:45:18

解决方案2
1 2015-04-15 03:44:11

在Python中处理文件时出现内存错误

问题描述

2 个解决方案

解决方案1 2 2015-04-15 03:45:18

解决方案2 1 2015-04-15 03:44:11

解决方案1
2 2015-04-15 03:45:18

解决方案2
1 2015-04-15 03:44:11