在Python中，如何删除HTML代码段中的“ root”标签？

Question

Suppose I have an HTML snippet like this: 假设我有一个这样的HTML代码段：

<div>
  Hello <strong>There</strong>
  <div>I think <em>I am</em> feeing better!</div>
  <div>Don't you?</div>
  Yup!
</div>

What's the best/most robust way to remove the surrounding root element, so it looks like this: 删除周围的根元素的最佳/最可靠的方法是什么，所以它看起来像这样：

Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!

I've tried using lxml.html like this: 我试过像这样使用lxml.html：

lxml.html.fromstring(fragment_string).drop_tag() lxml.html.fromstring（fragment_string）.drop_tag（）

But that only gives me "Hello", which I guess makes sense. 但这只会给我“你好”，我认为这很有意义。 Any better ideas? 有更好的想法吗？

Answer 1

This is a bit odd in lxml (or ElementTree). 在lxml（或ElementTree）中这有点奇怪。 You'd have to do: 您必须执行以下操作：

def inner_html(el):
    return (el.text or '') + ''.join(tostring(child) for child in el)

Note that lxml (and ElementTree) have no special way to represent a document except rooted with a single element, but .drop_tag() would work like you want if that <div> wasn't the root element. 请注意，lxml（和ElementTree）除了以单个元素为根以外，没有其他表示文档的特殊方法，但是如果<div>不是根元素， .drop_tag()会像您希望的那样工作。

Answer 2

You can use BeautifulSoup package. 您可以使用BeautifulSoup软件包。 For this particular html I would go like this: 对于这个特定的html，我会这样：

import BeautifulSoup

html = """<div>
  Hello <strong>There</strong>
  <div>I think <em>I am</em> feeing better!</div>
  <div>Don't you?</div>
  Yup!
</div>"""

bs = BeautifulSoup.BeautifulSoup(html)

no_root = '\n'.join(map(unicode, bs.div.contents))

BeautifulSoup has many nice features that will allow you to tweak this example for many other cases. BeautifulSoup具有许多不错的功能，可让您针对许多其他情况调整此示例。 Full documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html . 完整文档： http : //www.crummy.com/software/BeautifulSoup/documentation.html 。

Answer 3

For such a simple task you can use regexp like r'<(.*?)>(.*)</\\1>' and get match #2 (\\2 in perl terms) from it 对于这样一个简单的任务，您可以使用regexp之类的r'<(.*?)>(.*)</\\1>'并从中获取匹配项2（在perl中为\\ 2）

You should also put flags like ms for correct multi-line working 您还应该放置ms标记，以确保正确的多行工作

在Python中，如何删除HTML代码段中的“ root”标签？

问题描述

3 个解决方案

解决方案1
6 已采纳 2010-06-09 04:17:59

解决方案2
1 2010-06-09 15:48:51

解决方案3
0 2010-06-09 08:13:58

在Python中，如何删除HTML代码段中的“ root”标签？

问题描述

3 个解决方案

解决方案1 6 已采纳 2010-06-09 04:17:59

解决方案2 1 2010-06-09 15:48:51

解决方案3 0 2010-06-09 08:13:58

解决方案1
6 已采纳 2010-06-09 04:17:59

解决方案2
1 2010-06-09 15:48:51

解决方案3
0 2010-06-09 08:13:58