如何在Python中編碼/解碼此BeautifulSoup字符串，以便輸出非標准拉丁字符？

Question

我正在使用Beautiful Soup抓取一個頁面，輸出包含顯示為十六進制的非標准拉丁字符。

我正在刮刮https://www.archchinese.com 。 它包含拼音單詞，使用非標准的拉丁字符（例如，ǎ，ā）。 我一直試圖循環一系列包含拼音的鏈接，使用BeautifulSoup .string函數和utf-8編碼輸出這些單詞。 這個詞在非標准字符的位置以十六進制出現。 “hǎo”這個詞出現為“h \\ xc7 \\ x8eo”。 我確定我在編碼方面做錯了，但我不知道該怎么解決。 我首先嘗試使用utf-8進行解碼，但是我收到一個錯誤，即該元素沒有解碼功能。 試圖在沒有編碼的情況下打印字符串會給我一個關於字符未定義的錯誤，我認為這是因為它們需要首先編碼為某些字符。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re

url = "https://www.archchinese.com/"

driver = webdriver.Chrome() #Set selenium up for opening page with Chrome.
driver.implicitly_wait(30)
driver.get(url)

driver.find_element_by_id('dictSearch').send_keys('好') # This character is hǎo.

python_button = driver.find_element_by_id('dictSearchBtn')
python_button.click() # Look for submit button and click it.

soup=BeautifulSoup(driver.page_source, 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string.encode('utf-8')) # Loop through all links with pinyin and attempt to encode.

實際結果：b'h \\ xc7 \\ x8eo'b'h \\ xc3 \\ xa0o'

預期結果：hǎohào

編輯：問題似乎與Windows中的UnicodeEncodeError有關。 我試過安裝win-unicode-console ，但沒有運氣。 感謝snakecharmerb的信息。

Answer 1

打印時無需對值進行編碼 - 打印功能會自動處理。 現在，您打印的是構成編碼值的字節的表示，而不僅僅是字符串本身。

>>> s = 'hǎo'
>>> print(s)
hǎo

>>> print(s.encode('utf-8'))
b'h\xc7\x8eo'

Answer 2

在調用BeautifulSoup時使用encode，而不是在之后。

soup=BeautifulSoup(driver.page_source.encode('utf-8'), 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string)

如何在Python中編碼/解碼此BeautifulSoup字符串，以便輸出非標准拉丁字符？

問題描述

2 個解決方案

解決方案1
2 已采納 2018-12-22 18:53:34

解決方案2
1 2018-12-22 19:48:41

如何在Python中編碼/解碼此BeautifulSoup字符串，以便輸出非標准拉丁字符？

問題描述

2 個解決方案

解決方案1 2 已采納 2018-12-22 18:53:34

解決方案2 1 2018-12-22 19:48:41

解決方案1
2 已采納 2018-12-22 18:53:34

解決方案2
1 2018-12-22 19:48:41