如何从BeautifulSoup输出中取消特殊字符？

Question

我面临着像°和®这样的特殊字符的问题，这些字符代表了华氏度符号和注册符号，

当我打印包含特殊字符的字符串时，它给出如下输出：

Preheat oven to 350&deg; F
Welcome to Lorem Ipsum Inc&reg;

有没有办法可以输出确切的字符，而不是他们的代码？ 请告诉我。

Answer 1

$ python -c'from BeautifulSoup import BeautifulSoup
> print BeautifulSoup("""<html>Preheat oven to 350&deg; F
> Welcome to Lorem Ipsum Inc&reg;""",
> convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0].string'
Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®

Answer 2

这是一个脚本，用于容忍从网页中取消HTML引用 - 它假设引用是例如在° 格式后用分号（例如Preheat oven to 350° F ）：

from htmlentitydefs import name2codepoint

# Get the whitespace characters
nums_dict = {0: ' ', 1: '\t', 2: '\r', 3: '\n'}
chars_dict = dict((x, y) for y, x in nums_dict.items())
nums_dict2XML = {0: '&#32;', 1: '&#09;', 2: '&#13;', 3: '&#10;'}
chars_dict2XML = dict((nums_dict[i], nums_dict2XML[i]) for i in nums_dict2XML)

s = '1234567890ABCDEF'
hex_dict = {}
for i in s:
    hex_dict[i.lower()] = None
hex_dict[i.upper()] = None
del s

def is_hex(s):
    if not s:
        return False

    for i in s:
        if i not in hex_dict:
            return False
    return True

class Unescape:
    def __init__(self, s, ignore_whitespace=False):
        # Converts HTML character references into a unicode string to allow manipulation
        self.s = s
        self.ignore_whitespace = ignore_whitespace
        self.lst = self.process(ignore_whitespace)

    def process(self, ignore_whitespace):
        def get_char(c):
            if ignore_whitespace:
                return c
            else:
                if c in chars_dict:
                    return chars_dict[c]
                else: return c

        r = []
        lst = self.s.split('&')
        xx = 0
        yy = 0
        for item in lst:
            if xx:
                split = item.split(';')
                if split[0].lower() in name2codepoint:
                    # A character reference, e.g. '&amp;'
                    a = unichr(name2codepoint[split[0].lower()])
                    r.append(get_char(a)) # TOKEN CHECK?
                    r.append(';'.join(split[1:]))

                elif split[0] and split[0][0] == '#' and split[0][1:].isdigit():
                    # A character number e.g. '&#52;'
                    a = unichr(int(split[0][1:]))
                    r.append(get_char(a))
                    r.append(';'.join(split[1:]))

                elif split[0] and split[0][0] == '#' and split[0][1:2].lower() == 'x' and is_hex(split[0][2:]):
                    # A hexadecimal encoded character
                    a = unichr(int(split[0][2:].lower(), 16)) # Hex -> base 16
                    r.append(get_char(a))
                    r.append(';'.join(split[1:]))

                else:
                    r.append('&%s' % ';'.join(split))
            else:
                r.append(item)
            xx += 1
            yy += len(r[-1])
        return r

def get_value(self):
    # Convert back into HTML, preserving
    # whitespace if self.ignore_whitespace is `False`
    r = []
    for i in self.lst:
        if type(i) == int:
            r.append(nums_dict2XML[i])
        else:
            r.append(i)
    return ''.join(r)

def unescape(s):
    # Get the string value from escaped HTML `s`, ignoring
    # explicit whitespace like tabs/spaces etc
    inst = Unescape(s, ignore_whitespace=True)
    return ''.join(inst.lst)

if __name__ == '__main__':
    print unescape('Preheat oven to 350&deg; F')
print unescape('Welcome to Lorem Ipsum Inc&reg;')

编辑：这是一个更简单的解决方案，只用字符替换字符引用而不是&#xx; 引用：

from htmlentitydefs import name2codepoint

def unescape(s):
    for name in name2codepoint:
        s = s.replace('&%s;' % name, unichr(name2codepoint[name]))
    return s

print unescape('Preheat oven to 350&deg; F')
print unescape('Welcome to Lorem Ipsum Inc&reg;')

Answer 3

在美丽的汤4：

my_text = """Preheat oven to 350&deg; F
Welcome to Lorem Ipsum Inc&reg; """

soup = BeautifulSoup(my_text, 'html.parser')

print(soup)

结果：

Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®

Answer 4

我想某个地方，一个程序引用了＆deg和＆reg而没有分号。 尝试使用“＆deg”+“;” 和“＆reg”+“;” 在HTML文件中，如果它确实是HTML文件。 请解释一下背景。

如何从BeautifulSoup输出中取消特殊字符？

问题描述

4 个解决方案

解决方案1
8 已采纳 2010-05-20 04:55:58

解决方案2
2 2010-05-19 12:39:00

解决方案3
1 2016-07-02 07:11:18

解决方案4
0 2010-05-19 12:20:17

如何从BeautifulSoup输出中取消特殊字符？

问题描述

4 个解决方案

解决方案1 8 已采纳 2010-05-20 04:55:58

解决方案2 2 2010-05-19 12:39:00

解决方案3 1 2016-07-02 07:11:18

解决方案4 0 2010-05-19 12:20:17

解决方案1
8 已采纳 2010-05-20 04:55:58

解决方案2
2 2010-05-19 12:39:00

解决方案3
1 2016-07-02 07:11:18

解决方案4
0 2010-05-19 12:20:17