[英]UnicodeDecodeError in Python
I have a text file, its size is more than 200 MB. 我有一个文本文件,其大小超过200 MB。 I want to read it and then want to select 30 most frequently used words.
我想阅读它,然后选择30个最常用的单词。 When i run it, it give me error.
当我运行它时,它给我错误。 The code is as under:-
代码如下:
import sys, string
import codecs
from collections import Counter
import collections
import unicodedata
with open('E:\\Book\\1800.txt', "r", encoding='utf-8') as File_1800:
for line in File_1800:
sepFile_1800 = line.lower()
words_1800 = re.findall('\w+', sepFile_1800)
for wrd_1800 in [words_1800]:
long_1800=[w for w in wrd_1800 if len(w)>3]
common_words_1800 = dict(Counter(long_1800).most_common(30))
print(common_words_1800)
Traceback (most recent call last):
File "C:\Python34\CommonWords.py", line 14, in <module>
for line in File_1800:
File "C:\Python34\lib\codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position
3784: invalid start byte
The file does not contain 'UTF-8'
encoded data. 该文件不包含
'UTF-8'
编码的数据。 Find the correct encoding and update the line: with open('E:\\\\Book\\\\1800.txt', "r", encoding='correct_encoding')
找到正确的编码并更新该行:
with open('E:\\\\Book\\\\1800.txt', "r", encoding='correct_encoding')
Try encoding='latin1'
instead of utf-8
尝试
encoding='latin1'
而不是utf-8
Also, in these lines: 另外,在这些行中:
for line in File_1800:
sepFile_1800 = line.lower()
words_1800 = re.findall('\w+', sepFile_1800)
for wrd_1800 in [words_1800]:
...
The script is re-assigning the matches of re.findall
to the words_1800
variable for every line. 该脚本将
re.findall
的匹配项重新分配给每一行的words_1800
变量。 So when you get to for wrd_1800 in [words_1800]
, the words_1800
variable only has matches from the very last line. 因此,当您到达
for wrd_1800 in [words_1800]
中的words_1800
, words_1800
变量仅与最后一行匹配。
If you want to make minimal changes, initialize an empty list before iterating through the file: 如果要进行最小的更改,请在遍历文件之前初始化一个空列表:
words_1800 = []
And then add the matches for each line to the list, rather than replacing the list: 然后将每行的匹配项添加到列表中,而不是替换列表:
words_1800.extend(re.findall('\w+', sepFile_1800))
Then you can do (without the second for loop): 然后,您可以做(没有第二个for循环):
long_1800 = [w for w in words_1800 if len(w) > 3]
common_words_1800 = dict(Counter(long_1800).most_common(30))
print(common_words_1800)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.