在Python中，如何刪除HTML代碼段中的“ root”標簽？

Question

假設我有一個這樣的HTML代碼段：

<div>
  Hello <strong>There</strong>
  <div>I think <em>I am</em> feeing better!</div>
  <div>Don't you?</div>
  Yup!
</div>

刪除周圍的根元素的最佳/最可靠的方法是什么，所以它看起來像這樣：

Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!

我試過像這樣使用lxml.html：

lxml.html.fromstring（fragment_string）.drop_tag（）

但這只會給我“你好”，我認為這很有意義。 有更好的想法嗎？

Answer 1

在lxml（或ElementTree）中這有點奇怪。 您必須執行以下操作：

def inner_html(el):
    return (el.text or '') + ''.join(tostring(child) for child in el)

請注意，lxml（和ElementTree）除了以單個元素為根以外，沒有其他表示文檔的特殊方法，但是如果<div>不是根元素， .drop_tag()會像您希望的那樣工作。

Answer 2

您可以使用BeautifulSoup軟件包。 對於這個特定的html，我會這樣：

import BeautifulSoup

html = """<div>
  Hello <strong>There</strong>
  <div>I think <em>I am</em> feeing better!</div>
  <div>Don't you?</div>
  Yup!
</div>"""

bs = BeautifulSoup.BeautifulSoup(html)

no_root = '\n'.join(map(unicode, bs.div.contents))

BeautifulSoup具有許多不錯的功能，可讓您針對許多其他情況調整此示例。 完整文檔： http : //www.crummy.com/software/BeautifulSoup/documentation.html 。

Answer 3

對於這樣一個簡單的任務，您可以使用regexp之類的r'<(.*?)>(.*)</\\1>'並從中獲取匹配項2（在perl中為\\ 2）

您還應該放置ms標記，以確保正確的多行工作

在Python中，如何刪除HTML代碼段中的“ root”標簽？

問題描述

3 個解決方案

解決方案1
6 已采納 2010-06-09 04:17:59

解決方案2
1 2010-06-09 15:48:51

解決方案3
0 2010-06-09 08:13:58

在Python中，如何刪除HTML代碼段中的“ root”標簽？

問題描述

3 個解決方案

解決方案1 6 已采納 2010-06-09 04:17:59

解決方案2 1 2010-06-09 15:48:51

解決方案3 0 2010-06-09 08:13:58

解決方案1
6 已采納 2010-06-09 04:17:59

解決方案2
1 2010-06-09 15:48:51

解決方案3
0 2010-06-09 08:13:58