I am extracting HTML from some webpage with Unicode characters as follows:
def extract(url):
""" Adapted from Python3_Google_Search.py """
user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/525.13 (KHTML, like Gecko)"
"Chrome/0.2.149.29 Safari/525.13")
request = urllib.request.Request(url)
request.add_header("User-Agent",user_agent)
response = urllib.request.urlopen(request)
html = response.read().decode("utf8")
return html
I am decoding properly as you can see. So html
is now a unicode string. When printing html, I can see the Unicode characters.
I am using html.parser
to parse the HTML and subclassed it:
from html.parser import HTMLParser
class Parser(HTMLParser):
def __init__(self):
## some init stuff
#### rest of class
When parsing out the HTML using the class's handle_data
, it appears that the Unicode characters are removed/suddenly disappear. The docs do not mention anything about encodings. Why does HTML Parser remove non-ascii characters, and how can I fix such an issue?
Apparently, html.parser
will call handle_entityref
whenever it encounters a non-ascii character. It passes the named character reference, and to convert that to the unicode character, I used:
html.entities.html5[name]
Python's documentation does not mention that. I've never seen worse documentation that Python.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.