python 3.5.2中的UnicodeDecodeError

Question

UnicodeDecodeError

def getWordFreqs(textPath, stopWordsPath):
    wordFreqs = dict()
    #open the file in read mode and open stop words
    file = open(textPath, 'r')
    stopWords = set(line.strip() for line in open(stopWordsPath))
    #read the text
    text = file.read()
    #exclude punctuation and convert to lower case; exclude numbers as well
    punctuation = set('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~')
    text = ''.join(ch.lower() for ch in text if ch not in punctuation)
    text = ''.join(ch for ch in text if not ch.isdigit())
    #read through the words and add to frequency dictionary
    #if it is not a stop word
    for word in text.split():
        if word not in stopWords:
            if word in wordFreqs:
                wordFreqs[word] += 1
            else:
                wordFreqs[word] = 1

每當我嘗試在python 3.5.2中運行此函數時，都會收到以下錯誤，但是在3.4.3中它可以正常工作，我無法弄清楚是什么原因導致了此錯誤。

line 9, in getWordFreqs
    text = file.read()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 520: ordinal not in range(128)

Answer 1

在Python 3中，默認open是使用locale.getpreferredencoding(False)返回的編碼。 它通常不是ascii ，但是如果在某種錯誤消息指示的某種框架下運行則可以。

而是，指定要嘗試讀取的文件的編碼。 如果文件是在Windows下創建的，則編碼可能是cp1252 ，尤其是因為字節\\x97是該編碼下的EM DASH 。

嘗試：

file = open(textPath, 'r', encoding='cp1252')

Answer 2

我相信解決您問題的一種方法是將這段代碼放在文件頂部。

import sys
reload(sys)
sys.setdefaultencoding("UTF8")

這會將編碼設置為UTF8

另一個（更好的）解決方案是一個稱為編解碼器的庫，該庫非常易於使用。

import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )

然后，fileObj是可以讀取和寫入的普通文件對象。

方法1的來源方法2的來源

方法1的注意事項
當使用使用ASCII作為其編碼的第三方應用程序時，這可能非常危險。 請謹慎使用。

python 3.5.2中的UnicodeDecodeError

問題描述

2 個解決方案

解決方案1
1 2016-10-30 01:48:01

解決方案2
-2 2016-10-30 00:18:20

python 3.5.2中的UnicodeDecodeError

問題描述

2 個解決方案

解決方案1 1 2016-10-30 01:48:01

解決方案2 -2 2016-10-30 00:18:20

解決方案1
1 2016-10-30 01:48:01

解決方案2
-2 2016-10-30 00:18:20