如何在Python中取消撇號等？

Question

我有一個帶有這樣的符號的字符串：

&#39;

顯然那是撇號。

我沒有任何運氣就嘗試了saxutils.unescape（）並嘗試了urllib.unquote（）

我該如何解碼？ 謝謝！

Answer 1

看看這個問題。 您正在尋找的是“ html實體解碼”。 通常，您會找到一個名為“ htmldecode”的函數，它將完成您想要的事情。 Django和Cheetah都提供BeautifulSoup這樣的功能。

如果您不想使用庫並且所有實體都是數字，則另一個答案將非常有用。

Answer 2

試試這個：（在這里找到）

from htmlentitydefs import name2codepoint as n2cp
import re

def decode_htmlentities(string):
    """
    Decode HTML entities–hex, decimal, or named–in a string
    @see http://snippets.dzone.com/posts/show/4569

    >>> u = u'E tu vivrai nel terrore - L&#x27;aldil&#xE0; (1981)'
    >>> print decode_htmlentities(u).encode('UTF-8')
    E tu vivrai nel terrore - L'aldilà (1981)
    >>> print decode_htmlentities("l&#39;eau")
    l'eau
    >>> print decode_htmlentities("foo &lt; bar")                
    foo < bar
    """
    def substitute_entity(match):
        ent = match.group(3)
        if match.group(1) == "#":
            # decoding by number
            if match.group(2) == '':
                # number is in decimal
                return unichr(int(ent))
            elif match.group(2) == 'x':
                # number is in hex
                return unichr(int('0x'+ent, 16))
        else:
            # they were using a name
            cp = n2cp.get(ent)
            if cp: return unichr(cp)
            else: return match.group()

    entity_re = re.compile(r'&(#?)(x?)(\w+);')
    return entity_re.subn(substitute_entity, string)[0]

Answer 3

最強大的解決方案似乎是Python專家Fredrik Lundh的此功能。 這不是最短的解決方案，但它可以處理命名實體以及十六進制和十進制代碼。

如何在Python中取消撇號等？

問題描述

3 個解決方案

解決方案1
2 已采納 2009-05-03 03:54:01

解決方案2
2 2009-05-03 11:12:42

解決方案3
1 2009-05-03 08:53:22

如何在Python中取消撇號等？

問題描述

3 個解決方案

解決方案1 2 已采納 2009-05-03 03:54:01

解決方案2 2 2009-05-03 11:12:42

解決方案3 1 2009-05-03 08:53:22

解決方案1
2 已采納 2009-05-03 03:54:01

解決方案2
2 2009-05-03 11:12:42

解決方案3
1 2009-05-03 08:53:22