在本地 HTML 文件上使用 Python 中的 Beautiful Soup 錯誤的重音字符

Question

我對 Python 中的 Beautiful Soup 非常熟悉，我一直習慣於抓取實時站點。

現在我正在抓取一個本地 HTML 文件（鏈接，以防您想測試代碼），唯一的問題是重音字符沒有以正確的方式表示（這在我抓取實時站點時從未發生過）。

這是代碼的簡化版本

import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('AH.html'), "html.parser")
tables = soup.find_all('table')
titles = tables[0].find_all('tr')
print(titles[55].text)

打印以下輸出

2:22 - Il Destino Ãˆ GiÃ Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]

而正確的輸出應該是

2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]

我尋找了一個解決方案，閱讀了許多問題/答案並找到了這個答案，我通過以下方式實現

import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
import codecs

response = open('AH.html')
content = response.read()
html = codecs.decode(content, 'utf-8')
soup = BeautifulSoup(html, "html.parser")

但是，它運行以下錯誤

Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\user\Desktop\score.py", line 8, in <module>
    html = codecs.decode(content, 'utf-8')
TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')

我想這個問題很容易解決，但是怎么做呢？

Answer 1

from bs4 import BeautifulSoup


with open("AH.html") as f:
    soup = BeautifulSoup(f, 'html.parser')
    tb = soup.find("table")
    for item in tb.find_all("tr")[55]:
        print(item.text)

我不得不說，你的第一個代碼實際上很好，應該可以工作。

關於第二個代碼，您正在嘗試decode有問題的str 。 因為decode功能用於byte object 。

我相信您使用的Windows的默認編碼是cp1252而不是UTF-8 。

能否請您運行以下代碼：

import sys

print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)

並檢查您的輸出是UTF-8還是cp1252 。

請注意，如果您將VSCode與Code-Runner ，請在終端中以py code.py運行您的代碼

解決方案（來自聊天）

(1) 如果您使用的是 Windows 10

打開控制面板並通過小圖標更改視圖
單擊區域
單擊管理選項卡
單擊更改系統區域設置...
勾選“測試版：使用 Unicode UTF-8...”
單擊“確定”並重新啟動您的電腦

(2) 如果您不是在 Windows 10 上或者只是不想更改以前的設置，那么在第一個代碼open("AH.html")更改為open("AH.html", encoding="UTF-8") ，即寫：

from bs4 import BeautifulSoup

with open("AH.html", encoding="UTF-8") as f:
    soup = BeautifulSoup(f, 'html.parser')
    tb = soup.find("table")
    for item in tb.find_all("tr")[55]:
        print(item.text)

Answer 2

使用open('AH.html')使用可能不是文件編碼的默認編碼對文件進行解碼。 BeautifulSoup理解 HTML 頭文件，具體如下內容表明文件是 UTF-8 編碼的：

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

以二進制模式打開文件，讓BeautifulSoup找出來：

with open("AH.html","rb") as f:
    soup = BeautifulSoup(f, 'html.parser')

有時，網站會錯誤地設置編碼。 在這種情況下，如果您知道它應該是什么，您可以自己指定編碼。

with open("AH.html",encoding='utf8') as f:
    soup = BeautifulSoup(f, 'html.parser')

在本地 HTML 文件上使用 Python 中的 Beautiful Soup 錯誤的重音字符

問題描述

2 個解決方案

解決方案1
0 已采納 2020-03-18 15:01:02

解決方案（來自聊天）

解決方案2
0 2020-03-18 18:48:30

在本地 HTML 文件上使用 Python 中的 Beautiful Soup 錯誤的重音字符

問題描述

2 個解決方案

解決方案1 0 已采納 2020-03-18 15:01:02

解決方案（來自聊天）

解決方案2 0 2020-03-18 18:48:30

解決方案1
0 已采納 2020-03-18 15:01:02

解決方案2
0 2020-03-18 18:48:30