UnicodeDecodeError'utf-8'编解码器无法解码位置2893中的字节0x92：无效的起始字节

Question

I'm trying to open a series of HTML files in order to get the text from the body of those files using BeautifulSoup. 我正在尝试打开一系列HTML文件，以便使用BeautifulSoup从这些文件的主体中获取文本。 I have about 435 files that I wanted to run through but I keep getting this error. 我有大约435个文件要运行，但是一直出现此错误。

I've tried converting the HTML files to text and opening the text files but I get the same error... 我试过将HTML文件转换为文本并打开文本文件，但遇到相同的错误...

path = "./Bitcoin"
for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

I want to get the source code of the HTML file so I can parse it using beautifulsoup but I get this error 我想获取HTML文件的源代码，以便可以使用beautifulsoup解析它，但出现此错误

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-133-f32d00599677> in <module>
      3 for file in os.listdir(path):
      4     with open(os.path.join(path, file), "r") as fname:
----> 5         txt = fname.read()

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

Answer 1

There are various approaches to dealing with text data with unknown encodings. 有多种方法可以处理编码未知的文本数据。 However in this case, as you intend pass the data to Beautiful Soup, the solution is simple: don't bother trying to decode the file yourself, let Beautiful Soup do it. 但是，在这种情况下，因为您打算将数据传递给Beautiful Soup，所以解决方案很简单：不要费心尝试自己解码文件，让Beautiful Soup来做。 Beautiful Soup will automatically decode bytes to unicode . Beautiful Soup会自动将字节解码为unicode 。

In your current code, you read the file in text mode, which means that Python will assume that the file is encoded as UTF-8 unless you provide an encoding argument to the open function. 在当前代码中，您以文本模式读取文件，这意味着Python会假定该文件已编码为UTF-8，除非您为open函数提供了编码参数。 This causes an error if the file's contents are not valid UTF-8. 如果文件内容无效的UTF-8，则会导致错误。

for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

Instead, read the html files in binary mode and pass the resulting bytes instance to Beautiful Soup. 而是以二进制模式读取html文件，并将生成的bytes实例传递给Beautiful Soup。

for file in os.listdir(path):
    with open(os.path.join(path, file), "rb") as fname:
        bytes_ = fname.read()
soup = BeautifulSoup(bytes_)

FWIW, the file currently causing your problem is probably encoded with cp1252 or a similar windows 8-bit encoding. FWIW，当前引起问题的文件可能使用cp1252或类似的Windows 8位编码进行编码。

>>> '’'.encode('cp1252')
b'\x92'

UnicodeDecodeError'utf-8'编解码器无法解码位置2893中的字节0x92：无效的起始字节

问题描述

1 个解决方案

解决方案1
1 2019-04-26 18:07:27

UnicodeDecodeError&#39;utf-8&#39;编解码器无法解码位置2893中的字节0x92：无效的起始字节

问题描述

1 个解决方案

解决方案1 1 2019-04-26 18:07:27

UnicodeDecodeError'utf-8'编解码器无法解码位置2893中的字节0x92：无效的起始字节

解决方案1
1 2019-04-26 18:07:27