Python中的UnicodeDecodeError

Question

I have a text file, its size is more than 200 MB. 我有一个文本文件，其大小超过200 MB。 I want to read it and then want to select 30 most frequently used words. 我想阅读它，然后选择30个最常用的单词。 When i run it, it give me error. 当我运行它时，它给我错误。 The code is as under:- 代码如下：

    import sys, string 
    import codecs 
    from collections import Counter
    import collections
    import unicodedata
    with open('E:\\Book\\1800.txt', "r", encoding='utf-8') as File_1800:
    for line in File_1800: 
       sepFile_1800 = line.lower()
        words_1800 = re.findall('\w+', sepFile_1800)
    for wrd_1800 in [words_1800]:
        long_1800=[w for w in wrd_1800 if len(w)>3]
        common_words_1800 = dict(Counter(long_1800).most_common(30))
    print(common_words_1800)


    Traceback (most recent call last):
    File "C:\Python34\CommonWords.py", line 14, in <module>
    for line in File_1800:
    File "C:\Python34\lib\codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position    
    3784: invalid start byte

Answer 1

The file does not contain 'UTF-8' encoded data. 该文件不包含'UTF-8'编码的数据。 Find the correct encoding and update the line: with open('E:\\\\Book\\\\1800.txt', "r", encoding='correct_encoding') 找到正确的编码并更新该行： with open('E:\\\\Book\\\\1800.txt', "r", encoding='correct_encoding')

Answer 2

Try encoding='latin1' instead of utf-8 尝试encoding='latin1'而不是utf-8

Also, in these lines: 另外，在这些行中：

for line in File_1800:
    sepFile_1800 = line.lower()
    words_1800 = re.findall('\w+', sepFile_1800)
for wrd_1800 in [words_1800]:
    ...

The script is re-assigning the matches of re.findall to the words_1800 variable for every line. 该脚本将re.findall的匹配项重新分配给每一行的words_1800变量。 So when you get to for wrd_1800 in [words_1800] , the words_1800 variable only has matches from the very last line. 因此，当您到达for wrd_1800 in [words_1800]中的words_1800 ， words_1800变量仅与最后一行匹配。

If you want to make minimal changes, initialize an empty list before iterating through the file: 如果要进行最小的更改，请在遍历文件之前初始化一个空列表：

words_1800 = []

And then add the matches for each line to the list, rather than replacing the list: 然后将每行的匹配项添加到列表中，而不是替换列表：

words_1800.extend(re.findall('\w+', sepFile_1800))

Then you can do (without the second for loop): 然后，您可以做（没有第二个for循环）：

long_1800 = [w for w in words_1800 if len(w) > 3]
common_words_1800 = dict(Counter(long_1800).most_common(30))
print(common_words_1800)

Python中的UnicodeDecodeError

问题描述

2 个解决方案

解决方案1
1 2015-09-17 06:14:16

解决方案2
0 已采纳 2015-09-17 06:16:31

Python中的UnicodeDecodeError

问题描述

2 个解决方案

解决方案1 1 2015-09-17 06:14:16

解决方案2 0 已采纳 2015-09-17 06:16:31

解决方案1
1 2015-09-17 06:14:16

解决方案2
0 已采纳 2015-09-17 06:16:31