[英]In Python, how do I remove the “root” tag in an HTML snippet?
Suppose I have an HTML snippet like this: 假设我有一个这样的HTML代码段:
<div>
Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!
</div>
What's the best/most robust way to remove the surrounding root element, so it looks like this: 删除周围的根元素的最佳/最可靠的方法是什么,所以它看起来像这样:
Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!
I've tried using lxml.html like this: 我试过像这样使用lxml.html:
lxml.html.fromstring(fragment_string).drop_tag() lxml.html.fromstring(fragment_string).drop_tag()
But that only gives me "Hello", which I guess makes sense. 但这只会给我“你好”,我认为这很有意义。 Any better ideas?
有更好的想法吗?
This is a bit odd in lxml (or ElementTree). 在lxml(或ElementTree)中这有点奇怪。 You'd have to do:
您必须执行以下操作:
def inner_html(el):
return (el.text or '') + ''.join(tostring(child) for child in el)
Note that lxml (and ElementTree) have no special way to represent a document except rooted with a single element, but .drop_tag()
would work like you want if that <div>
wasn't the root element. 请注意,lxml(和ElementTree)除了以单个元素为根以外,没有其他表示文档的特殊方法,但是如果
<div>
不是根元素, .drop_tag()
会像您希望的那样工作。
You can use BeautifulSoup package. 您可以使用BeautifulSoup软件包。 For this particular html I would go like this:
对于这个特定的html,我会这样:
import BeautifulSoup
html = """<div>
Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!
</div>"""
bs = BeautifulSoup.BeautifulSoup(html)
no_root = '\n'.join(map(unicode, bs.div.contents))
BeautifulSoup has many nice features that will allow you to tweak this example for many other cases. BeautifulSoup具有许多不错的功能,可让您针对许多其他情况调整此示例。 Full documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html .
完整文档: http : //www.crummy.com/software/BeautifulSoup/documentation.html 。
For such a simple task you can use regexp like r'<(.*?)>(.*)</\\1>'
and get match #2 (\\2 in perl terms) from it 对于这样一个简单的任务,您可以使用regexp之类的
r'<(.*?)>(.*)</\\1>'
并从中获取匹配项2(在perl中为\\ 2)
You should also put flags like ms
for correct multi-line working 您还应该放置
ms
标记,以确保正确的多行工作
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.