[英]MemoryError in Python
I have text file, its size is 300 MB. 我有文本文件,它的大小是300 MB。 I want to read it and then print 50 most frequently used words.
我想阅读它,然后打印50个最常用的单词。 When i run the program it gives me MemoryError.
当我运行程序时,它给了我MemoryError。 My code is as under:-
我的代码如下: -
import sys, string
import codecs
import re
from collections import Counter
import collections
import itertools
import csv
import re
import unicodedata
words_1800 = []
with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
for line in File_1800:
sepFile_1800 = line.lower()
words_1800.extend(re.findall('\w+', sepFile_1800))
for wrd_1800 in [words_1800]:
long_1800=[w for w in words_1800 if len(w)>3]
common_words_1800 = dict(Counter(long_1800).most_common(50))
print(common_words_1800)
It give me the following error:- 它给我以下错误: -
Traceback (most recent call last):
File "C:\Python34\CommonWords.py", line 17, in <module>
words_1800.extend(re.findall('\w+', sepFile_1800))
MemoryError
You can use a generator container instead of a list to store the result of re.findall
which is much optimized in terms of memory use, you can also use re.finditer
instead of findall
which returns an iterator. 您可以使用生成器容器而不是列表来存储
re.findall
的结果,该结果在内存使用方面进行了大量优化,您也可以使用re.finditer
而不是findall
来返回迭代器。
with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
words_1800=(re.findall('\w+', line.lower()) for line in File_1800)
Then the words_1800
will be an iterator contain lists of founded words or use 然后
words_1800
将是一个包含已创建单词或使用列表的迭代器
with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
words_1800=(re.finditer('\w+', line.lower()) for line in File_1800)
to get an iterator contains iterators. 获取迭代器包含迭代器。
You can use the Counter
upfront saving you memory from using intermediate lists (especially words_1800
which is as big as the file you're reading): 您可以使用
Counter
预先使用中间列表来保存内存(尤其是与您正在阅读的文件一样大的words_1800
):
common_words_1800 = Counter()
with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
for line in File_1800:
for match in re.finditer(r'\w+', line.lower()):
word = match.group()
if len(word) > 3:
common_words_1800[word] += 1
print(common_words_1800.most_common(50))
If your file contains ascii you don't need a regex, you can split the words and rstrip the punctuation creating your Counter with a generator expression: 如果你的文件包含ascii,你不需要正则表达式,你可以拆分单词和rstrip标点符号,用生成器表达式创建你的计数器:
from string import punctuation
from collections import Counter
with open('E:\\Book\\1800.txt') as f:
cn = Counter(wrd for line in f for wrd in (w.rstrip(punctuation)
for w in line.lower().split()) if len(wrd) > 3)
print(cn.most_common(50))
If you were using a regex you should compile it first and you can use it with a generator: 如果您使用的是正则表达式,则应首先编译它,然后将其与生成器一起使用:
from collections import Counter
import re
with open('E:\\Book\\1800.txt') as f:
r = re.compile("\w+")
cn = Counter(wrd for line in f
for wrd in r.findall(line) if len(wrd) > 3)
print(cn.most_common(50))
Your code is working good, however it looks a little bit memory inefficient. 你的代码运行良好,但它看起来有点内存效率低下。 If your file has 300 MB then there can be a lot of words to process.
如果您的文件有300 MB,那么可以处理很多单词。 Try to use suggestions given by @Kasramvd.
尝试使用@Kasramvd提供的建议。 It seems to be a good idea to use iterators instead of full lists.
使用迭代器而不是完整列表似乎是个好主意。
In addition, here is a fine blog post about checking memory usage and profiling applications in python - Python - memory usage . 另外,这里有一篇关于在python中检查内存使用情况和分析应用程序的精彩博文 - Python - 内存使用情况 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.