简体   繁体   中英

Unicode Disappearing in html.parser

I am extracting HTML from some webpage with Unicode characters as follows:

def extract(url):
     """ Adapted from Python3_Google_Search.py """
     user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
                   "AppleWebKit/525.13 (KHTML,     like Gecko)"
                   "Chrome/0.2.149.29 Safari/525.13")
     request = urllib.request.Request(url)
     request.add_header("User-Agent",user_agent)
     response = urllib.request.urlopen(request)
     html = response.read().decode("utf8")
     return html

I am decoding properly as you can see. So html is now a unicode string. When printing html, I can see the Unicode characters.

I am using html.parser to parse the HTML and subclassed it:

from html.parser import HTMLParser
class Parser(HTMLParser):
  def __init__(self):
    ## some init stuff
  #### rest of class

When parsing out the HTML using the class's handle_data , it appears that the Unicode characters are removed/suddenly disappear. The docs do not mention anything about encodings. Why does HTML Parser remove non-ascii characters, and how can I fix such an issue?

Apparently, html.parser will call handle_entityref whenever it encounters a non-ascii character. It passes the named character reference, and to convert that to the unicode character, I used:

html.entities.html5[name]

Python's documentation does not mention that. I've never seen worse documentation that Python.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM