简体   繁体   English

将 HTML 实体转换为 Unicode,反之亦然

[英]Convert HTML entities to Unicode and vice versa

How do you convert HTML entities to Unicode and vice versa in Python?如何在 Python 中将 HTML 实体转换为 Unicode 和反之亦然?

As to the "vice versa" (which I needed myself, leading me to find this question, which didn't help, and subsequently another site which had the answer ):至于“反之亦然”(我需要自己,导致我找到这个没有帮助的问题,然后是另一个有答案的网站):

u'some string'.encode('ascii', 'xmlcharrefreplace')

will return a plain string with any non-ascii characters turned into XML (HTML) entities.将返回一个纯字符串,其中任何非 ascii 字符都转换为 XML (HTML) 实体。

You need to have BeautifulSoup .你需要有BeautifulSoup

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

Update for Python 2.7 and BeautifulSoup4 Python 2.7 和 BeautifulSoup4 的更新

Unescape -- Unicode HTML to unicode with htmlparser (Python 2.7 standard lib): Unescape -- Unicode HTML 到 unicode 与htmlparser (Python 2.7 标准库):

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape -- Unicode HTML to unicode with bs4 (BeautifulSoup4): Unescape -- Unicode HTML 到 unicode 与bs4 (BeautifulSoup4):

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Escape -- Unicode to unicode HTML with bs4 (BeautifulSoup4):转义 -- Unicode 到 unicode HTML 与bs4 (BeautifulSoup4):

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

As hekevintran answer suggests, you may use cgi.escape(s) for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True keyword argument alongside your string.正如hekevintran回答所暗示的,您可以使用cgi.escape(s)对字符串进行编码,但请注意,在 function 中,quote 的编码默认为 false,并且将quote=True关键字参数与您的字符串一起传递可能是个好主意。 But even by passing quote=True , the function won't escape single quotes ( "'" ) (Because of these issues the function has been deprecated since version 3.2)但即使通过quote=True , function 也不会转义单引号( "'" )(由于这些问题, function 自 3.2 版以来已被弃用

It's been suggested to use html.escape(s) instead of cgi.escape(s) .建议使用html.escape(s)而不是cgi.escape(s) (New in version 3.2) (版本 3.2 中的新功能)

Also html.unescape(s) has been introduced in version 3.4 . html.unescape(s)在 3.4 版本中引入

So in python 3.4 you can:因此,在 python 3.4 中,您可以:

  • Use html.escape(text).encode('ascii', 'xmlcharrefreplace').decode() to convert special characters to HTML entities.使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()将特殊字符转换为 HTML 实体。
  • And html.unescape(text) for converting HTML entities back to plain-text representations. html.unescape(text)用于将 HTML 实体转换回纯文本表示形式。
$ python3 -c "
> import html
> print(
>     html.unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python3 -c "
> import html
> print(
>     html.escape('&©—')
> )"
&amp;©—

$ python2 -c "
> from HTMLParser import HTMLParser
> print(
>     HTMLParser().unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python2 -c "
> import cgi
> print(
>     cgi.escape('&©—')
> )"
&amp;©—

HTML only strictly requires & (ampersand) and < (left angle bracket / less-than sign) to be escaped. HTML 只严格要求& (与号)和< (左尖括号/小于号)被转义。 https://html.spec.whatwg.org/multipage/parsing.html#data-state https://html.spec.whatwg.org/multipage/parsing.html#data-state

For python3 use html.unescape() :对于python3使用html.unescape()

import html
s = "&amp;"
decoded = html.unescape(s)
# &

If someone like me is out there wondering why some entity numbers (codes) like &#153; (for trademark symbol), &#128; (for euro symbol)如果像我这样的人想知道为什么有些实体编号(代码)像&#153; (for trademark symbol), &#128; (for euro symbol) &#153; (for trademark symbol), &#128; (for euro symbol) &#153; (for trademark symbol), &#128; (for euro symbol) are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined. &#153; (for trademark symbol), &#128; (for euro symbol)未正确编码,原因是在 ISO-8859-1(又名 Windows-1252)中未定义这些字符。

Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4另请注意,从 html5 开始的默认字符集是 utf-8 它是 html4 的 ISO-8859-1

So, we will have to workaround somehow (find & replace those at first)所以,我们将不得不以某种方式解决(首先找到并替换那些)

Reference (starting point) from Mozilla's documentation Mozilla 文档中的参考(起点)

https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings

I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:我使用以下 function 将从 xls 文件中提取的 unicode 转换为 html 文件,同时保留 xls 文件中的特殊字符:

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

hope this is useful to somebody希望这对某人有用

#!/usr/bin/env python3
import fileinput
import html

for line in fileinput.input():
    print(html.unescape(line.rstrip('\n')))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM