简体   繁体   English

Python:unescape特殊字符,不分割数据

[英]Python: unescape special characters without splitting data

I have made a simple HTML parser which is basically a direct copy from the docs. 我已经制作了一个简单的HTML解析器,它基本上是来自文档的直接复制。 I am having trouble unescaping special characters without also splitting up data into multiple chunks. 我无法在不将数据拆分成多个块的情况下对特殊字符进行转义。

Here is my code with a simple example: 这是我的代码,有一个简单的例子:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = []

    def handle_starttag(self, tag, attrs):
        #print (tag,attrs)
        pass

    def handle_endtag(self, tag):
        #print (tag)
        pass

    def handle_data(self, data):
        self.data.append(data)

    def handle_charref(self, ref):
        self.handle_entityref("#" + ref)

    def handle_entityref(self, ref):
        self.handle_data(self.unescape("&%s;" % ref))



n = "<strong>I &lt;3s U &amp; you luvz me</strong>"


parser = MyHTMLParser()
parser.feed(n)
parser.close()
data = parser.data
print(data)

The issue is that this returns 5 separate bits of data 问题是这会返回5个独立的数据位

['I ', u'<', '3s U ', u'&', ' you luvz me']

Where what I want is the single string: 我想要的是单个字符串:

['I <3s U & you luvz me']

Thanks JP 谢谢JP

Join the list of strings using str.join : 使用str.join加入字符串列表:

>>> ''.join(['I ', u'<', '3s U ', u'&', ' you luvz me'])
u'I <3s U & you luvz me'

Alternatively, you can use external libraries, like lxml : 或者,您可以使用外部库,例如lxml

>>> import lxml.html
>>> n = "<strong>I &lt;3s U &amp; you luvz me</strong>"
>>> root = lxml.html.fromstring(n)
>>> root.text_content()
'I <3s U & you luvz me'

Remember that the purpose of HTMLParser is to let you build a document tree from an input. 请记住,HTMLParser的目的是让您从输入构建文档树。 If you don't care at all about the document's structure, then the str.join solution @falsetru gives will be fine. 如果您根本不关心文档的结构,那么str.join解决方案@falsetru给出的就可以了。 You can be certain that all element tags and comments will be filtered out. 您可以确定将过滤掉所有元素标记和注释。

However, if you do need the structure for more complex scenarios then you have to build a document tree. 但是,如果确实需要更复杂场景的结构,则必须构建文档树。 The handle_starttag and handle_endtag methods are here for this. handle_starttaghandle_endtag方法就在这里。

First we need a basic tree that can hold some information. 首先,我们需要一个可以保存一些信息的基本树。

class Element:
    def __init__(self, parent, tag, attrs=None):
        self.parent = parent
        self.tag = tag
        self.children = []
        self.attrs = attrs or []
        self.data = ''

Now you need to make the HTMLParser make a new node on every handle_starttag and move up the tree on every handle_endtag . 现在,您需要让HTMLParser在每个handle_starttag上创建一个新节点,并在每个handle_endtag上向上移动树。 We also pass the parsed data to the current node instead of holding it in the parser. 我们还将解析后的数据传递给当前节点,而不是将其保存在解析器中。

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.root = Element(NONE, '__DOCROOT__') # Special root node for us
        self.current = self.root

    def handle_starttag(self, tag, attrs):
        newel = Element(self.current tag, attrs)
        self.current.children.append(newel)
        self.current = newel

    def handle_endtag(self, tag):
        self.current = self.current.parent

    def handle_data(self, data):
        self.current.data += data

    def handle_charref(self, ref): # No changes here
        self.handle_entityref('#' + ref)

    def handle_entityref(self, ref): # No changes here either
        self.handle_data(self.unescape("&%s" % ref))

Now you can access the tree on MyHTMLParser.root to get the data from any element as you like. 现在,您可以访问MyHTMLParser.root上的树, MyHTMLParser.root根据需要从任何元素获取数据。 For example 例如

n = '<strong>I &lt;3s U &amp; you luvz me</strong>'
p = MyHTMLParser()
p.feed(n)
p.close()

def print_tree(node, indent=0):
    print('    ' * indent + node.tag)
    print('    ' * indent + '  ' + node.data)
    for c in node.children:
        print_tree(c, indent + 1)

print_tree(p.root)

This will give you 这会给你

__DOCROOT__

    strong
      I <3s U & you luvz me

If instead you parsed n = <html><head><title>Test</title></head><body><h1>I &lt;3s U &amp; you luvz me</h1></body></html> 相反,如果您解析了n = <html><head><title>Test</title></head><body><h1>I &lt;3s U &amp; you luvz me</h1></body></html> n = <html><head><title>Test</title></head><body><h1>I &lt;3s U &amp; you luvz me</h1></body></html> You would get. n = <html><head><title>Test</title></head><body><h1>I &lt;3s U &amp; you luvz me</h1></body></html>你会得到的。

__DOCROOT__

    html

        head

            title
              Test
        body

            h1
              I <3s U & you luvz me

Next up is to make the tree building robust and handle cases like mismatched or implicit endtags. 接下来是使树构建健壮并处理不匹配或隐式结束标记等情况。 You will also want to add some nice find('tag') like methods on Element for traversing the tree. 您还需要在Element上添加一些很好的find('tag')方法来遍历树。 Do it well enough and you'll have made the next BeautifulSoup . 做得好,你会做出下一个BeautifulSoup

You can refer this answer . 你可以参考这个答案

And edit html_to_text function for you want. 并编辑你想要的html_to_text函数。

from HTMLParser import HTMLParser
n = "<strong>I &lt;3s U &amp; you luvz me</strong>"

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return HTMLParser().unescape(s.get_data())

print html_to_text(n)

Output: 输出:

I <3s U & you luvz me

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Python 字符串中对特殊字符进行转义? - How Do You Unescape Special Characters In A Python String? 在python中搜索和分割带有特殊字符的字符串 - Searching and splitting strings with special characters in python Python - 用特殊字符和数字分割字符串 - Python - Splitting a string by special characters and numbers 如何从BeautifulSoup输出中取消特殊字符? - How to unescape special characters from BeautifulSoup output? 在 python 中拆分字符串时如何考虑特殊字符 - How to account for special characters when splitting string in python 如何在将pyquery对象转换为字符串时取消特殊字符 - How to unescape special characters while converting pyquery object to string 我如何在python 3.6中转换字符? - How do I unescape characters in python 3.6? 在不使用特殊字符的情况下在python中加密和解密 - Encrypting and decrypting in python without using special characters 将空白字符串拆分为列表,但不拆分引号中的空白,并且还允许在 Python 中的引号中包含特殊字符(如 $、% 等) - Splitting whitespace string into list but not splitting whitespace in quotes and also allow special characters (like $, %, etc) in quotes in Python 如何删除json数据python中的特殊字符 - How to remove special characters in json data python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM