[英]Unicode Disappearing in html.parser
I am extracting HTML from some webpage with Unicode characters as follows: 我从具有Unicode字符的某些网页中提取HTML,如下所示:
def extract(url):
""" Adapted from Python3_Google_Search.py """
user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/525.13 (KHTML, like Gecko)"
"Chrome/0.2.149.29 Safari/525.13")
request = urllib.request.Request(url)
request.add_header("User-Agent",user_agent)
response = urllib.request.urlopen(request)
html = response.read().decode("utf8")
return html
I am decoding properly as you can see. 如您所见,我正在正确解码。 So
html
is now a unicode string. 所以
html
现在是一个unicode字符串。 When printing html, I can see the Unicode characters. 打印html时,我可以看到Unicode字符。
I am using html.parser
to parse the HTML and subclassed it: 我正在使用
html.parser
解析HTML并将其子类化:
from html.parser import HTMLParser
class Parser(HTMLParser):
def __init__(self):
## some init stuff
#### rest of class
When parsing out the HTML using the class's handle_data
, it appears that the Unicode characters are removed/suddenly disappear. 当使用类的
handle_data
解析HTML时,似乎Unicode字符已删除/突然消失。 The docs do not mention anything about encodings. 该文档没有提及有关编码的任何内容。 Why does HTML Parser remove non-ascii characters, and how can I fix such an issue?
为什么HTML Parser会删除非ASCII字符,如何解决此问题?
Apparently, html.parser
will call handle_entityref
whenever it encounters a non-ascii character. 显然,
html.parser
将调用handle_entityref
每当遇到非ASCII字符。 It passes the named character reference, and to convert that to the unicode character, I used: 它传递命名的字符引用,并将其转换为Unicode字符,我使用了:
html.entities.html5[name]
Python's documentation does not mention that. Python的文档没有提及。 I've never seen worse documentation that Python.
我从未见过比Python更糟糕的文档。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.