简体   繁体   English

UnicodeDecodeError'utf-8'编解码器无法解码位置2893中的字节0x92:无效的起始字节

[英]UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

I'm trying to open a series of HTML files in order to get the text from the body of those files using BeautifulSoup. 我正在尝试打开一系列HTML文件,以便使用BeautifulSoup从这些文件的主体中获取文本。 I have about 435 files that I wanted to run through but I keep getting this error. 我有大约435个文件要运行,但是一直出现此错误。

I've tried converting the HTML files to text and opening the text files but I get the same error... 我试过将HTML文件转换为文本并打开文本文件,但遇到相同的错误...

path = "./Bitcoin"
for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

I want to get the source code of the HTML file so I can parse it using beautifulsoup but I get this error 我想获取HTML文件的源代码,以便可以使用beautifulsoup解析它,但出现此错误

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-133-f32d00599677> in <module>
      3 for file in os.listdir(path):
      4     with open(os.path.join(path, file), "r") as fname:
----> 5         txt = fname.read()

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

There are various approaches to dealing with text data with unknown encodings. 有多种方法可以处理编码未知的文本数据。 However in this case, as you intend pass the data to Beautiful Soup, the solution is simple: don't bother trying to decode the file yourself, let Beautiful Soup do it. 但是,在这种情况下,因为您打算将数据传递给Beautiful Soup,所以解决方案很简单:不要费心尝试自己解码文件,让Beautiful Soup来做。 Beautiful Soup will automatically decode bytes to unicode . Beautiful Soup会自动将字节解码为unicode

In your current code, you read the file in text mode, which means that Python will assume that the file is encoded as UTF-8 unless you provide an encoding argument to the open function. 在当前代码中,您以文本模式读取文件,这意味着Python会假定该文件已编码为UTF-8,除非您为open函数提供了编码参数。 This causes an error if the file's contents are not valid UTF-8. 如果文件内容无效的UTF-8,则会导致错误。

for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

Instead, read the html files in binary mode and pass the resulting bytes instance to Beautiful Soup. 而是以二进制模式读取html文件,并将生成的bytes实例传递给Beautiful Soup。

for file in os.listdir(path):
    with open(os.path.join(path, file), "rb") as fname:
        bytes_ = fname.read()
soup = BeautifulSoup(bytes_)

FWIW, the file currently causing your problem is probably encoded with cp1252 or a similar windows 8-bit encoding. FWIW,当前引起问题的文件可能使用cp1252或类似的Windows 8位编码进行编码。

>>> '’'.encode('cp1252')
b'\x92'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我不断收到 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte - I keep getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte UnicodeDecodeError:“ utf8”编解码器无法解码位置661中的字节0x92:无效的起始字节 - UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 661: invalid start byte “utf-8”编解码器无法解码 position 107 中的字节 0x92:无效的起始字节 - 'utf-8' codec can't decode byte 0x92 in position 107: invalid start byte “utf-8”编解码器无法解码位置 11 中的字节 0x92:起始字节无效 - 'utf-8' codec can't decode byte 0x92 in position 11: invalid start byte “utf-8”编解码器无法解码 position 18 中的字节 0x92:无效的起始字节 - 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte Anaconda:UnicodeDecodeError:&#39;utf8&#39;编解码器无法解码位置1412中的字节0x92:无效的起始字节 - Anaconda: UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte 使用 CSVLogger 时出错:“utf-8”编解码器无法解码位置 144 中的字节 0x92:起始字节无效 - Error using CSVLogger: 'utf-8' codec can't decode byte 0x92 in position 144: invalid start byte Python错误:“ utf8”编解码器无法解码位置85的字节0x92:无效的起始字节 - Python error: 'utf8' codec can't decode byte 0x92 in position 85: invalid start byte UnicodeDecodeError: &#39;utf-8&#39; 编解码器无法解码位置 3131 中的字节 0x80:起始字节无效 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte Python UnicodeDecodeError:“ utf-8”编解码器无法解码位置2的字节0x8c:无效的起始字节 - Python UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8c in position 2: invalid start byte
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM