[英]Python: unescape special characters without splitting data
I have made a simple HTML parser which is basically a direct copy from the docs. 我已经制作了一个简单的HTML解析器,它基本上是来自文档的直接复制。 I am having trouble unescaping special characters without also splitting up data into multiple chunks. 我无法在不将数据拆分成多个块的情况下对特殊字符进行转义。
Here is my code with a simple example: 这是我的代码,有一个简单的例子:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = []
def handle_starttag(self, tag, attrs):
#print (tag,attrs)
pass
def handle_endtag(self, tag):
#print (tag)
pass
def handle_data(self, data):
self.data.append(data)
def handle_charref(self, ref):
self.handle_entityref("#" + ref)
def handle_entityref(self, ref):
self.handle_data(self.unescape("&%s;" % ref))
n = "<strong>I <3s U & you luvz me</strong>"
parser = MyHTMLParser()
parser.feed(n)
parser.close()
data = parser.data
print(data)
The issue is that this returns 5 separate bits of data 问题是这会返回5个独立的数据位
['I ', u'<', '3s U ', u'&', ' you luvz me']
Where what I want is the single string: 我想要的是单个字符串:
['I <3s U & you luvz me']
Thanks JP 谢谢JP
Join the list of strings using str.join
: 使用str.join
加入字符串列表:
>>> ''.join(['I ', u'<', '3s U ', u'&', ' you luvz me'])
u'I <3s U & you luvz me'
Alternatively, you can use external libraries, like lxml
: 或者,您可以使用外部库,例如lxml
:
>>> import lxml.html
>>> n = "<strong>I <3s U & you luvz me</strong>"
>>> root = lxml.html.fromstring(n)
>>> root.text_content()
'I <3s U & you luvz me'
Remember that the purpose of HTMLParser is to let you build a document tree from an input. 请记住,HTMLParser的目的是让您从输入构建文档树。 If you don't care at all about the document's structure, then the str.join
solution @falsetru gives will be fine. 如果您根本不关心文档的结构,那么str.join
解决方案@falsetru给出的就可以了。 You can be certain that all element tags and comments will be filtered out. 您可以确定将过滤掉所有元素标记和注释。
However, if you do need the structure for more complex scenarios then you have to build a document tree. 但是,如果确实需要更复杂场景的结构,则必须构建文档树。 The handle_starttag
and handle_endtag
methods are here for this. handle_starttag
和handle_endtag
方法就在这里。
First we need a basic tree that can hold some information. 首先,我们需要一个可以保存一些信息的基本树。
class Element:
def __init__(self, parent, tag, attrs=None):
self.parent = parent
self.tag = tag
self.children = []
self.attrs = attrs or []
self.data = ''
Now you need to make the HTMLParser make a new node on every handle_starttag
and move up the tree on every handle_endtag
. 现在,您需要让HTMLParser在每个handle_starttag
上创建一个新节点,并在每个handle_endtag
上向上移动树。 We also pass the parsed data to the current node instead of holding it in the parser. 我们还将解析后的数据传递给当前节点,而不是将其保存在解析器中。
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.root = Element(NONE, '__DOCROOT__') # Special root node for us
self.current = self.root
def handle_starttag(self, tag, attrs):
newel = Element(self.current tag, attrs)
self.current.children.append(newel)
self.current = newel
def handle_endtag(self, tag):
self.current = self.current.parent
def handle_data(self, data):
self.current.data += data
def handle_charref(self, ref): # No changes here
self.handle_entityref('#' + ref)
def handle_entityref(self, ref): # No changes here either
self.handle_data(self.unescape("&%s" % ref))
Now you can access the tree on MyHTMLParser.root
to get the data from any element as you like. 现在,您可以访问MyHTMLParser.root
上的树, MyHTMLParser.root
根据需要从任何元素获取数据。 For example 例如
n = '<strong>I <3s U & you luvz me</strong>'
p = MyHTMLParser()
p.feed(n)
p.close()
def print_tree(node, indent=0):
print(' ' * indent + node.tag)
print(' ' * indent + ' ' + node.data)
for c in node.children:
print_tree(c, indent + 1)
print_tree(p.root)
This will give you 这会给你
__DOCROOT__
strong
I <3s U & you luvz me
If instead you parsed n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html>
相反,如果您解析了n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html>
n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html>
You would get. n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html>
你会得到的。
__DOCROOT__
html
head
title
Test
body
h1
I <3s U & you luvz me
Next up is to make the tree building robust and handle cases like mismatched or implicit endtags. 接下来是使树构建健壮并处理不匹配或隐式结束标记等情况。 You will also want to add some nice find('tag')
like methods on Element
for traversing the tree. 您还需要在Element
上添加一些很好的find('tag')
方法来遍历树。 Do it well enough and you'll have made the next BeautifulSoup . 做得好,你会做出下一个BeautifulSoup 。
You can refer this answer . 你可以参考这个答案 。
And edit html_to_text
function for you want. 并编辑你想要的html_to_text
函数。
from HTMLParser import HTMLParser
n = "<strong>I <3s U & you luvz me</strong>"
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def handle_entityref(self, name):
self.fed.append('&%s;' % name)
def get_data(self):
return ''.join(self.fed)
def html_to_text(html):
s = MLStripper()
s.feed(html)
return HTMLParser().unescape(s.get_data())
print html_to_text(n)
Output: 输出:
I <3s U & you luvz me
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.