Python：unescape特殊字符，不分割数据

Question

I have made a simple HTML parser which is basically a direct copy from the docs. 我已经制作了一个简单的HTML解析器，它基本上是来自文档的直接复制。 I am having trouble unescaping special characters without also splitting up data into multiple chunks. 我无法在不将数据拆分成多个块的情况下对特殊字符进行转义。

Here is my code with a simple example: 这是我的代码，有一个简单的例子：

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = []

    def handle_starttag(self, tag, attrs):
        #print (tag,attrs)
        pass

    def handle_endtag(self, tag):
        #print (tag)
        pass

    def handle_data(self, data):
        self.data.append(data)

    def handle_charref(self, ref):
        self.handle_entityref("#" + ref)

    def handle_entityref(self, ref):
        self.handle_data(self.unescape("&%s;" % ref))



n = "<strong>I &lt;3s U &amp; you luvz me</strong>"


parser = MyHTMLParser()
parser.feed(n)
parser.close()
data = parser.data
print(data)

The issue is that this returns 5 separate bits of data 问题是这会返回5个独立的数据位

['I ', u'<', '3s U ', u'&', ' you luvz me']

Where what I want is the single string: 我想要的是单个字符串：

['I <3s U & you luvz me']

Thanks JP 谢谢JP

Answer 1

Join the list of strings using str.join : 使用str.join加入字符串列表：

>>> ''.join(['I ', u'<', '3s U ', u'&', ' you luvz me'])
u'I <3s U & you luvz me'

Alternatively, you can use external libraries, like lxml : 或者，您可以使用外部库，例如lxml ：

>>> import lxml.html
>>> n = "<strong>I &lt;3s U &amp; you luvz me</strong>"
>>> root = lxml.html.fromstring(n)
>>> root.text_content()
'I <3s U & you luvz me'

Answer 2

Remember that the purpose of HTMLParser is to let you build a document tree from an input. 请记住，HTMLParser的目的是让您从输入构建文档树。 If you don't care at all about the document's structure, then the str.join solution @falsetru gives will be fine. 如果您根本不关心文档的结构，那么str.join解决方案@falsetru给出的就可以了。 You can be certain that all element tags and comments will be filtered out. 您可以确定将过滤掉所有元素标记和注释。

However, if you do need the structure for more complex scenarios then you have to build a document tree. 但是，如果确实需要更复杂场景的结构，则必须构建文档树。 The handle_starttag and handle_endtag methods are here for this. handle_starttag和handle_endtag方法就在这里。

First we need a basic tree that can hold some information. 首先，我们需要一个可以保存一些信息的基本树。

class Element:
    def __init__(self, parent, tag, attrs=None):
        self.parent = parent
        self.tag = tag
        self.children = []
        self.attrs = attrs or []
        self.data = ''

Now you need to make the HTMLParser make a new node on every handle_starttag and move up the tree on every handle_endtag . 现在，您需要让HTMLParser在每个handle_starttag上创建一个新节点，并在每个handle_endtag上向上移动树。 We also pass the parsed data to the current node instead of holding it in the parser. 我们还将解析后的数据传递给当前节点，而不是将其保存在解析器中。

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.root = Element(NONE, '__DOCROOT__') # Special root node for us
        self.current = self.root

    def handle_starttag(self, tag, attrs):
        newel = Element(self.current tag, attrs)
        self.current.children.append(newel)
        self.current = newel

    def handle_endtag(self, tag):
        self.current = self.current.parent

    def handle_data(self, data):
        self.current.data += data

    def handle_charref(self, ref): # No changes here
        self.handle_entityref('#' + ref)

    def handle_entityref(self, ref): # No changes here either
        self.handle_data(self.unescape("&%s" % ref))

Now you can access the tree on MyHTMLParser.root to get the data from any element as you like. 现在，您可以访问MyHTMLParser.root上的树， MyHTMLParser.root根据需要从任何元素获取数据。 For example 例如

n = '<strong>I &lt;3s U &amp; you luvz me</strong>'
p = MyHTMLParser()
p.feed(n)
p.close()

def print_tree(node, indent=0):
    print('    ' * indent + node.tag)
    print('    ' * indent + '  ' + node.data)
    for c in node.children:
        print_tree(c, indent + 1)

print_tree(p.root)

This will give you 这会给你

__DOCROOT__

    strong
      I <3s U & you luvz me

If instead you parsed n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html> 相反，如果您解析了n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html> n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html> You would get. n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html>你会得到的。

__DOCROOT__

    html

        head

            title
              Test
        body

            h1
              I <3s U & you luvz me

Next up is to make the tree building robust and handle cases like mismatched or implicit endtags. 接下来是使树构建健壮并处理不匹配或隐式结束标记等情况。 You will also want to add some nice find('tag') like methods on Element for traversing the tree. 您还需要在Element上添加一些很好的find('tag')方法来遍历树。 Do it well enough and you'll have made the next BeautifulSoup . 做得好，你会做出下一个BeautifulSoup 。

Answer 3

You can refer this answer . 你可以参考这个答案。

And edit html_to_text function for you want. 并编辑你想要的html_to_text函数。

from HTMLParser import HTMLParser
n = "<strong>I &lt;3s U &amp; you luvz me</strong>"

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return HTMLParser().unescape(s.get_data())

print html_to_text(n)

Output: 输出：

I <3s U & you luvz me

Python：unescape特殊字符，不分割数据

问题描述

3 个解决方案

解决方案1
3 已采纳 2014-01-02 03:58:05

解决方案2
1 2014-01-02 07:57:46

解决方案3
1 2014-01-02 08:21:20

Python：unescape特殊字符，不分割数据

问题描述

3 个解决方案

解决方案1 3 已采纳 2014-01-02 03:58:05

解决方案2 1 2014-01-02 07:57:46

解决方案3 1 2014-01-02 08:21:20

解决方案1
3 已采纳 2014-01-02 03:58:05

解决方案2
1 2014-01-02 07:57:46

解决方案3
1 2014-01-02 08:21:20