![](/img/trans.png)
[英]How to get all strings from all nested tags of a xml tag with python's lxml.etree library?
[英]how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?
如何告訴etree.strip_tags()
從給定的標簽元素中去除所有可能的標簽?
我是否必須自己對它們進行 map,例如:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
也許我不知道一種更優雅的方法?
示例輸入:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
所需的 Output:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
甚至更好:
This is some text with multiple tags and sometimes they are nested.
您可以使用lxml.html.clean
模塊:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
這個答案有點晚了,但我想一個比ars最初答案提供的更簡單的解決方案可能會方便保管。
簡答
調用strip_tags()
時使用"*"
參數來指定要剝離的所有標簽。
長答案
給定您的 XML 字符串,我們可以創建一個lxml 元素:
>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)
您可以像這樣檢查該實例:
>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
要刪除除parent
標簽本身之外的所有標簽,請按照您的建議使用etree.strip_tags()
function,但使用"*"
參數:
>>> lxml.etree.strip_tags(parent_tag, "*")
檢查顯示所有子標簽都消失了:
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'
這是您想要的 output。 請注意,這將修改 lxml Element 實例本身! 為了使它更好(如你所問:-))只需獲取text
屬性:
>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.