UnicodeDecodeError：“utf-8”编解码器无法解码位置 1 的字节 0x8b：无效的起始字节

Question

我正在尝试通过遵循 udacity 课程在 python 中制作爬虫。 我有这个方法get_page()返回页面的内容。

def get_page(url):
    '''
    Open the given url and return the content of the page.
    '''

    data = urlopen(url)
    html = data.read()
    return html.decode('utf8')

原来的方法只是返回data.read() ，但那样我就不能做像str.find()这样的操作。 快速搜索后，我发现我需要解码数据。 但现在我得到了这个错误

UnicodeDecodeError：“utf-8”编解码器无法解码位置 1 的字节 0x8b：无效的起始字节

我在 SO 中发现了类似的问题，但没有一个是专门针对这个的。 请帮忙。

Answer 1

您正在尝试解码无效字符串。

任何有效 UTF-8 字符串的起始字节必须在0x00到0x7F的范围内。 所以0x8B肯定是无效的。 来自RFC3629 第 3 节：

在 UTF-8 中，来自 U+0000..U+10FFFF 范围（UTF-16 可访问范围）的字符使用 1 到 4 个八位字节的序列进行编码。 一个“序列”的唯一八位字节将高位设置为 0，其余 7 位用于对字符编号进行编码。

您应该发布您尝试解码的字符串。

Answer 2

也许页面是用其他字符编码而不是“utf-8”编码的。 所以起始字节无效。 你可以这样做。

def get_page(self, url):
    if url is None:
        return None
    response=urllib.request.urlopen(url)
    if response.getcode()!=200:
        print("Http code:",response.getcode())
        return None
    else:
        try:
            return response.read().decode('utf-8')
        except:
            return response.read()

Answer 3

Web 服务器通常提供带有 Content-Type 标头的 HTML 页面，该标头包含用于对页面进行编码的编码。 标题可能如下所示：

Content-Type: text/html; charset=UTF-8

我们可以检查此标头的内容以找到用于解码页面的编码：

from urllib.request import urlopen        
    
def get_page(url):    
    """ Open the given url and return the content of the page."""    
    
    data = urlopen(url)    
    content_type = data.headers.get('content-type', '')    
    print(f'{content_type=}')    
    encoding = 'latin-1'    
    if 'charset' in content_type:    
        _, _, encoding = content_type.rpartition('=')    
        print(f'{encoding=}')    
    html = data.read()    
    return html.decode(encoding)

使用requests类似：

response = requests.get(url)
content_type = reponse.headers.get('content-type', '')

Latin-1（或 ISO-8859-1）是一个安全的默认值：它总是会解码任何字节（尽管结果可能没有用）。

如果服务器不提供内容类型标头，您可以尝试在 HTML 中查找指定编码的<meta>标记。 或者将响应字节传递给Beautiful Soup并让它尝试猜测编码。

UnicodeDecodeError：“utf-8”编解码器无法解码位置 1 的字节 0x8b：无效的起始字节

问题描述

3 个解决方案

解决方案1
0 2016-12-18 07:46:06

解决方案2
0 2017-12-22 02:00:53

解决方案3
0 2021-09-17 08:02:58

UnicodeDecodeError：“utf-8”编解码器无法解码位置 1 的字节 0x8b：无效的起始字节

问题描述

3 个解决方案

解决方案1 0 2016-12-18 07:46:06

解决方案2 0 2017-12-22 02:00:53

解决方案3 0 2021-09-17 08:02:58

解决方案1
0 2016-12-18 07:46:06

解决方案2
0 2017-12-22 02:00:53

解决方案3
0 2021-09-17 08:02:58