简体   繁体   中英

Python - Convert unicode hex to string

I am using Readability Parser API to extract content from a web page. It is ok when the web page is in Latin character set, but when I extract article in Cyrillic, it ends up with the following:

<div>&#x412;&#x432;&#x43E;&#x441;&#x43A;&#x440;&#x435;&#x441;&#x435;&#x43D;&#x44C;</div>...etc

The interesting thing here is that the title of a web page is extracted correctly in Cyrillic, but not the content. My attempt was to do the following as it suggested in this SO answer :

content = unicodedata.normalize('NFKD', content).encode('ascii','ignore')

but it did not work. Could you tell me if there is a way to convert this string before saving to database?

Please let me know if the title of my question explains correctly what I need. Thank you.

One way (Python 3.3):

>>> s='<div>&#x412;&#x432;&#x43E;&#x441;&#x43A;&#x440;&#x435;&#x441;&#x435;&#x43D;&#x44C;</div>'
>>> import html.parser
>>> h=html.parser.HTMLParser()
>>> h.unescape(s)
'<div>Ввоскресень</div>'

Python 2.7:

>>> s='<div>&#x412;&#x432;&#x43E;&#x441;&#x43A;&#x440;&#x435;&#x441;&#x435;&#x43D;&#x44C;</div>'
>>> import HTMLParser
>>> h=HTMLParser.HTMLParser()
>>> print(h.unescape(s))
<div>Ввоскресень</div>

PS I went to look for the documentation link and it looks like unescape isn't documented. Here's a way without using an undocumented API:

>>> re.sub(r'&#x(.*?);',lambda x: chr(int(x.group(1),16)),s)
'<div>Ввоскресень</div>'

Per comment it looks finally documented (and moved) in Python 3.4:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM