UnicodeDecodeError：“utf-8”編解碼器無法解碼位置 1 的字節 0x8b：無效的起始字節

Question

我正在嘗試通過遵循 udacity 課程在 python 中制作爬蟲。 我有這個方法get_page()返回頁面的內容。

def get_page(url):
    '''
    Open the given url and return the content of the page.
    '''

    data = urlopen(url)
    html = data.read()
    return html.decode('utf8')

原來的方法只是返回data.read() ，但那樣我就不能做像str.find()這樣的操作。 快速搜索后，我發現我需要解碼數據。 但現在我得到了這個錯誤

UnicodeDecodeError：“utf-8”編解碼器無法解碼位置 1 的字節 0x8b：無效的起始字節

我在 SO 中發現了類似的問題，但沒有一個是專門針對這個的。 請幫忙。

Answer 1

您正在嘗試解碼無效字符串。

任何有效 UTF-8 字符串的起始字節必須在0x00到0x7F的范圍內。 所以0x8B肯定是無效的。 來自RFC3629 第 3 節：

在 UTF-8 中，來自 U+0000..U+10FFFF 范圍（UTF-16 可訪問范圍）的字符使用 1 到 4 個八位字節的序列進行編碼。 一個“序列”的唯一八位字節將高位設置為 0，其余 7 位用於對字符編號進行編碼。

您應該發布您嘗試解碼的字符串。

Answer 2

也許頁面是用其他字符編碼而不是“utf-8”編碼的。 所以起始字節無效。 你可以這樣做。

def get_page(self, url):
    if url is None:
        return None
    response=urllib.request.urlopen(url)
    if response.getcode()!=200:
        print("Http code:",response.getcode())
        return None
    else:
        try:
            return response.read().decode('utf-8')
        except:
            return response.read()

Answer 3

Web 服務器通常提供帶有 Content-Type 標頭的 HTML 頁面，該標頭包含用於對頁面進行編碼的編碼。 標題可能如下所示：

Content-Type: text/html; charset=UTF-8

我們可以檢查此標頭的內容以找到用於解碼頁面的編碼：

from urllib.request import urlopen        
    
def get_page(url):    
    """ Open the given url and return the content of the page."""    
    
    data = urlopen(url)    
    content_type = data.headers.get('content-type', '')    
    print(f'{content_type=}')    
    encoding = 'latin-1'    
    if 'charset' in content_type:    
        _, _, encoding = content_type.rpartition('=')    
        print(f'{encoding=}')    
    html = data.read()    
    return html.decode(encoding)

使用requests類似：

response = requests.get(url)
content_type = reponse.headers.get('content-type', '')

Latin-1（或 ISO-8859-1）是一個安全的默認值：它總是會解碼任何字節（盡管結果可能沒有用）。

如果服務器不提供內容類型標頭，您可以嘗試在 HTML 中查找指定編碼的<meta>標記。 或者將響應字節傳遞給Beautiful Soup並讓它嘗試猜測編碼。

UnicodeDecodeError：“utf-8”編解碼器無法解碼位置 1 的字節 0x8b：無效的起始字節

問題描述

3 個解決方案

解決方案1
0 2016-12-18 07:46:06

解決方案2
0 2017-12-22 02:00:53

解決方案3
0 2021-09-17 08:02:58

UnicodeDecodeError：“utf-8”編解碼器無法解碼位置 1 的字節 0x8b：無效的起始字節

問題描述

3 個解決方案

解決方案1 0 2016-12-18 07:46:06

解決方案2 0 2017-12-22 02:00:53

解決方案3 0 2021-09-17 08:02:58

解決方案1
0 2016-12-18 07:46:06

解決方案2
0 2017-12-22 02:00:53

解決方案3
0 2021-09-17 08:02:58