Python中的MemoryError

Question

I have text file, its size is 300 MB. 我有文本文件，它的大小是300 MB。 I want to read it and then print 50 most frequently used words. 我想阅读它，然后打印50个最常用的单词。 When i run the program it gives me MemoryError. 当我运行程序时，它给了我MemoryError。 My code is as under:- 我的代码如下： -

import sys, string 
import codecs 
import re
from collections import Counter
import collections
import itertools
import csv
import re
import unicodedata


words_1800 = []

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
   for line in File_1800:
       sepFile_1800 = line.lower()
       words_1800.extend(re.findall('\w+', sepFile_1800))


for wrd_1800 in [words_1800]:
       long_1800=[w for w in words_1800 if len(w)>3]
       common_words_1800 = dict(Counter(long_1800).most_common(50))

print(common_words_1800)

It give me the following error:- 它给我以下错误： -

Traceback (most recent call last):
File "C:\Python34\CommonWords.py", line 17, in <module>
words_1800.extend(re.findall('\w+', sepFile_1800))
MemoryError

Answer 1

You can use a generator container instead of a list to store the result of re.findall which is much optimized in terms of memory use, you can also use re.finditer instead of findall which returns an iterator. 您可以使用生成器容器而不是列表来存储re.findall的结果，该结果在内存使用方面进行了大量优化，您也可以使用re.finditer而不是findall来返回迭代器。

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
       words_1800=(re.findall('\w+', line.lower()) for line in File_1800)

Then the words_1800 will be an iterator contain lists of founded words or use 然后words_1800将是一个包含已创建单词或使用列表的迭代器

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
       words_1800=(re.finditer('\w+', line.lower()) for line in File_1800)

to get an iterator contains iterators. 获取迭代器包含迭代器。

Answer 2

You can use the Counter upfront saving you memory from using intermediate lists (especially words_1800 which is as big as the file you're reading): 您可以使用Counter预先使用中间列表来保存内存（尤其是与您正在阅读的文件一样大的words_1800 ）：

common_words_1800 = Counter()

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
    for line in File_1800:
        for match in re.finditer(r'\w+', line.lower()):
            word = match.group()
            if len(word) > 3:
                common_words_1800[word] += 1

print(common_words_1800.most_common(50))

Answer 3

If your file contains ascii you don't need a regex, you can split the words and rstrip the punctuation creating your Counter with a generator expression: 如果你的文件包含ascii，你不需要正则表达式，你可以拆分单词和rstrip标点符号，用生成器表达式创建你的计数器：

from string import punctuation
from collections import Counter

with open('E:\\Book\\1800.txt') as f:
   cn = Counter(wrd for line in f for wrd in (w.rstrip(punctuation)
            for w in line.lower().split()) if len(wrd) > 3)
   print(cn.most_common(50))

If you were using a regex you should compile it first and you can use it with a generator: 如果您使用的是正则表达式，则应首先编译它，然后将其与生成器一起使用：

from collections import Counter
import re
with open('E:\\Book\\1800.txt') as f:
    r = re.compile("\w+")
    cn = Counter(wrd for line in f  
                 for wrd in r.findall(line) if len(wrd) > 3)
    print(cn.most_common(50))

Answer 4

Your code is working good, however it looks a little bit memory inefficient. 你的代码运行良好，但它看起来有点内存效率低下。 If your file has 300 MB then there can be a lot of words to process. 如果您的文件有300 MB，那么可以处理很多单词。 Try to use suggestions given by @Kasramvd. 尝试使用@Kasramvd提供的建议。 It seems to be a good idea to use iterators instead of full lists. 使用迭代器而不是完整列表似乎是个好主意。

In addition, here is a fine blog post about checking memory usage and profiling applications in python - Python - memory usage . 另外，这里有一篇关于在python中检查内存使用情况和分析应用程序的精彩博文 - Python - 内存使用情况。

Python中的MemoryError

问题描述

4 个解决方案

解决方案1
4 2015-09-17 09:15:10

解决方案2
3 已采纳 2015-09-17 09:22:42

解决方案3
1 2015-09-17 10:07:47

解决方案4
0 2015-09-17 09:24:22

Python中的MemoryError

问题描述

4 个解决方案

解决方案1 4 2015-09-17 09:15:10

解决方案2 3 已采纳 2015-09-17 09:22:42

解决方案3 1 2015-09-17 10:07:47

解决方案4 0 2015-09-17 09:24:22

解决方案1
4 2015-09-17 09:15:10

解决方案2
3 已采纳 2015-09-17 09:22:42

解决方案3
1 2015-09-17 10:07:47

解决方案4
0 2015-09-17 09:24:22