How do you convert HTML entities to Unicode and vice versa in Python?
As to the "vice versa" (which I needed myself, leading me to find this question, which didn't help, and subsequently another site which had the answer ):
u'some string'.encode('ascii', 'xmlcharrefreplace')
will return a plain string with any non-ascii characters turned into XML (HTML) entities.
You need to have BeautifulSoup .
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
Update for Python 2.7 and BeautifulSoup4
Unescape -- Unicode HTML to unicode with htmlparser
(Python 2.7 standard lib):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Unescape -- Unicode HTML to unicode with bs4
(BeautifulSoup4):
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Escape -- Unicode to unicode HTML with bs4
(BeautifulSoup4):
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
As hekevintran answer suggests, you may use cgi.escape(s)
for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True
keyword argument alongside your string. But even by passing quote=True
, the function won't escape single quotes ( "'"
) (Because of these issues the function has been deprecated since version 3.2)
It's been suggested to use html.escape(s)
instead of cgi.escape(s)
. (New in version 3.2)
Also html.unescape(s)
has been introduced in version 3.4 .
So in python 3.4 you can:
html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()
to convert special characters to HTML entities.html.unescape(text)
for converting HTML entities back to plain-text representations. $ python3 -c "
> import html
> print(
> html.unescape('&©—')
> )"
&©—
$ python3 -c "
> import html
> print(
> html.escape('&©—')
> )"
&©—
$ python2 -c "
> from HTMLParser import HTMLParser
> print(
> HTMLParser().unescape('&©—')
> )"
&©—
$ python2 -c "
> import cgi
> print(
> cgi.escape('&©—')
> )"
&©—
HTML only strictly requires &
(ampersand) and <
(left angle bracket / less-than sign) to be escaped. https://html.spec.whatwg.org/multipage/parsing.html#data-state
For python3
use html.unescape()
:
import html
s = "&"
decoded = html.unescape(s)
# &
If someone like me is out there wondering why some entity numbers (codes) like ™ (for trademark symbol), € (for euro symbol)
™ (for trademark symbol), € (for euro symbol)
™ (for trademark symbol), € (for euro symbol)
are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined.
Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4
So, we will have to workaround somehow (find & replace those at first)
Reference (starting point) from Mozilla's documentation
https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings
I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:
def html_wr(f, dat):
''' write dat to file f as html
. file is assumed to be opened in binary format
. if dat is nul it is replaced with non breakable space
. non-ascii characters are translated to xml
'''
if not dat:
dat = ' '
try:
f.write(dat.encode('ascii'))
except:
f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))
hope this is useful to somebody
#!/usr/bin/env python3
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.