簡體   English   中英

如何去除 xml 標記中的所有子標記,但使用 python 中的 lxml 將文本合並到括號?

[英]how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

如何告訴etree.strip_tags()從給定的標簽元素中去除所有可能的標簽?

我是否必須自己對它們進行 map,例如:

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)

也許我不知道一種更優雅的方法?

示例輸入:

parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"

所需的 Output:

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

甚至更好:

This is some text with multiple tags and sometimes they are nested.

您可以使用lxml.html.clean模塊:

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

這個答案有點晚了,但我想一個比ars最初答案提供的更簡單的解決方案可能會方便保管。

簡答

調用strip_tags()時使用"*"參數來指定要剝離的所有標簽。

長答案

給定您的 XML 字符串,我們可以創建一個lxml 元素

>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)

您可以像這樣檢查該實例:

>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

要刪除除parent標簽本身之外的所有標簽,請按照您的建議使用etree.strip_tags() function,但使用"*"參數:

>>> lxml.etree.strip_tags(parent_tag, "*")

檢查顯示所有子標簽都消失了:

>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'

這是您想要的 output。 請注意,這將修改 lxml Element 實例本身! 為了使它更好(如你所問:-))只需獲取text屬性:

>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM