html.parser中的Unicode消失

Question

I am extracting HTML from some webpage with Unicode characters as follows: 我从具有Unicode字符的某些网页中提取HTML，如下所示：

def extract(url):
     """ Adapted from Python3_Google_Search.py """
     user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
                   "AppleWebKit/525.13 (KHTML,     like Gecko)"
                   "Chrome/0.2.149.29 Safari/525.13")
     request = urllib.request.Request(url)
     request.add_header("User-Agent",user_agent)
     response = urllib.request.urlopen(request)
     html = response.read().decode("utf8")
     return html

I am decoding properly as you can see. 如您所见，我正在正确解码。 So html is now a unicode string. 所以html现在是一个unicode字符串。 When printing html, I can see the Unicode characters. 打印html时，我可以看到Unicode字符。

I am using html.parser to parse the HTML and subclassed it: 我正在使用html.parser解析HTML并将其子类化：

from html.parser import HTMLParser
class Parser(HTMLParser):
  def __init__(self):
    ## some init stuff
  #### rest of class

When parsing out the HTML using the class's handle_data , it appears that the Unicode characters are removed/suddenly disappear. 当使用类的handle_data解析HTML时，似乎Unicode字符已删除/突然消失。 The docs do not mention anything about encodings. 该文档没有提及有关编码的任何内容。 Why does HTML Parser remove non-ascii characters, and how can I fix such an issue? 为什么HTML Parser会删除非ASCII字符，如何解决此问题？

Answer 1

Apparently, html.parser will call handle_entityref whenever it encounters a non-ascii character. 显然， html.parser将调用handle_entityref每当遇到非ASCII字符。 It passes the named character reference, and to convert that to the unicode character, I used: 它传递命名的字符引用，并将其转换为Unicode字符，我使用了：

html.entities.html5[name]

Python's documentation does not mention that. Python的文档没有提及。 I've never seen worse documentation that Python. 我从未见过比Python更糟糕的文档。

html.parser中的Unicode消失

问题描述

1 个解决方案

解决方案1
0 2013-05-03 17:23:51

html.parser中的Unicode消失

问题描述

1 个解决方案

解决方案1 0 2013-05-03 17:23:51

解决方案1
0 2013-05-03 17:23:51