I am scraping a page with Beautiful Soup, and the output contains non-standard Latin characters that are showing up as hex.
I am scraping https://www.archchinese.com . It contains pinyin words, which use non-standard latin characters (ǎ, ā, for example). I've been trying to loop through a series of links that contain pinyin, using the BeautifulSoup .string function along with utf-8 encoding to output these words. The word comes out with hex in the places of non-standard characters. The word "hǎo" comes out as "h\\xc7\\x8eo". I'm sure I'm doing something wrong with encoding it, but I don't know enough to know what to fix. I tried decoding with utf-8 first, but I'm getting an error that the element has no decode function. Trying to print the string without encoding gives me an error about the characters being undefined, which, I figure, is because they need to be encoded to something first.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
url = "https://www.archchinese.com/"
driver = webdriver.Chrome() #Set selenium up for opening page with Chrome.
driver.implicitly_wait(30)
driver.get(url)
driver.find_element_by_id('dictSearch').send_keys('好') # This character is hǎo.
python_button = driver.find_element_by_id('dictSearchBtn')
python_button.click() # Look for submit button and click it.
soup=BeautifulSoup(driver.page_source, 'lxml')
div = soup.find(id='charDef') # Find div with the target links.
for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
print (a.string.encode('utf-8')) # Loop through all links with pinyin and attempt to encode.
Actual results: b'h\\xc7\\x8eo' b'h\\xc3\\xa0o'
Expected results: hǎo hào
EDIT: The problem seems to be related to the UnicodeEncodeError
in Windows. I've tried to install win-unicode-console
, but no luck. Thanks to snakecharmerb for the info.
You don't need to encode the values when printing - the print function will take care of this automatically. Right now, you're printing the representation of the bytes that make up the encoded value rather than just the string itself.
>>> s = 'hǎo'
>>> print(s)
hǎo
>>> print(s.encode('utf-8'))
b'h\xc7\x8eo'
Use encode while you are calling BeautifulSoup, not after.
soup=BeautifulSoup(driver.page_source.encode('utf-8'), 'lxml')
div = soup.find(id='charDef') # Find div with the target links.
for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
print (a.string)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.