將 HTML 實體轉換為 Unicode，反之亦然

Question

如何在 Python 中將 HTML 實體轉換為 Unicode 和反之亦然？

Answer 1

至於“反之亦然”（我需要自己，導致我找到這個沒有幫助的問題，然后是另一個有答案的網站）：

u'some string'.encode('ascii', 'xmlcharrefreplace')

將返回一個純字符串，其中任何非 ascii 字符都轉換為 XML (HTML) 實體。

Answer 2

你需要有BeautifulSoup 。

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&amp;' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&amp;'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&amp;, &reg;, &lt;, &gt;, &cent;, &pound;, &yen;, &euro;, &sect;, &copy;"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

Answer 3

Python 2.7 和 BeautifulSoup4 的更新

Unescape -- Unicode HTML 到 unicode 與htmlparser （Python 2.7 標准庫）：

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape -- Unicode HTML 到 unicode 與bs4 (BeautifulSoup4):

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

轉義 -- Unicode 到 unicode HTML 與bs4 (BeautifulSoup4)：

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

Answer 4

正如hekevintran回答所暗示的，您可以使用cgi.escape(s)對字符串進行編碼，但請注意，在 function 中，quote 的編碼默認為 false，並且將quote=True關鍵字參數與您的字符串一起傳遞可能是個好主意。 但即使通過quote=True ， function 也不會轉義單引號（ "'" ）（由於這些問題， function 自 3.2 版以來已被棄用）

建議使用html.escape(s)而不是cgi.escape(s) 。 （版本 3.2 中的新功能）

html.unescape(s)已在 3.4 版本中引入。

因此，在 python 3.4 中，您可以：

使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()將特殊字符轉換為 HTML 實體。
html.unescape(text)用於將 HTML 實體轉換回純文本表示形式。

Answer 5

$ python3 -c "
> import html
> print(
>     html.unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python3 -c "
> import html
> print(
>     html.escape('&©—')
> )"
&amp;©—

$ python2 -c "
> from HTMLParser import HTMLParser
> print(
>     HTMLParser().unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python2 -c "
> import cgi
> print(
>     cgi.escape('&©—')
> )"
&amp;©—

HTML 只嚴格要求& （與號）和< （左尖括號/小於號）被轉義。 https://html.spec.whatwg.org/multipage/parsing.html#data-state

Answer 6

對於python3使用html.unescape() ：

import html
s = "&amp;"
decoded = html.unescape(s)
# &

Answer 7

如果像我這樣的人想知道為什么有些實體編號（代碼）像 (for trademark symbol),  (for euro symbol)  (for trademark symbol),  (for euro symbol)  (for trademark symbol),  (for euro symbol)未正確編碼，原因是在 ISO-8859-1（又名 Windows-1252）中未定義這些字符。

另請注意，從 html5 開始的默認字符集是 utf-8 它是 html4 的 ISO-8859-1

所以，我們將不得不以某種方式解決（首先找到並替換那些）

Mozilla 文檔中的參考（起點）

https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings

Answer 8

我使用以下 function 將從 xls 文件中提取的 unicode 轉換為 html 文件，同時保留 xls 文件中的特殊字符：

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

希望這對某人有用

Answer 9

#!/usr/bin/env python3
import fileinput
import html

for line in fileinput.input():
    print(html.unescape(line.rstrip('\n')))

將 HTML 實體轉換為 Unicode，反之亦然

問題描述

9 個解決方案

解決方案1
103 2010-04-17 06:13:38

解決方案2
31 已采納 2009-03-31 15:57:56

解決方案3
21 2015-03-03 08:43:24

解決方案4
13 2014-07-09 00:02:40

解決方案5
6 2019-10-08 17:24:02

解決方案6
3 2020-05-06 23:22:20

解決方案7
2 2018-02-08 15:14:15

解決方案8
1 2017-05-17 14:18:29

解決方案9
0 2020-04-16 15:17:21

將 HTML 實體轉換為 Unicode，反之亦然

問題描述

9 個解決方案

解決方案1 103 2010-04-17 06:13:38

解決方案2 31 已采納 2009-03-31 15:57:56

解決方案3 21 2015-03-03 08:43:24

解決方案4 13 2014-07-09 00:02:40

解決方案5 6 2019-10-08 17:24:02

解決方案6 3 2020-05-06 23:22:20

解決方案7 2 2018-02-08 15:14:15

解決方案8 1 2017-05-17 14:18:29

解決方案9 0 2020-04-16 15:17:21

解決方案1
103 2010-04-17 06:13:38

解決方案2
31 已采納 2009-03-31 15:57:56

解決方案3
21 2015-03-03 08:43:24

解決方案4
13 2014-07-09 00:02:40

解決方案5
6 2019-10-08 17:24:02

解決方案6
3 2020-05-06 23:22:20

解決方案7
2 2018-02-08 15:14:15

解決方案8
1 2017-05-17 14:18:29

解決方案9
0 2020-04-16 15:17:21