繁体   English   中英

如何去除 xml 标记中的所有子标记,但使用 python 中的 lxml 将文本合并到括号?

[英]how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

如何告诉etree.strip_tags()从给定的标签元素中去除所有可能的标签?

我是否必须自己对它们进行 map,例如:

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)

也许我不知道一种更优雅的方法?

示例输入:

parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"

所需的 Output:

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

甚至更好:

This is some text with multiple tags and sometimes they are nested.

您可以使用lxml.html.clean模块:

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

这个答案有点晚了,但我想一个比ars最初答案提供的更简单的解决方案可能会方便保管。

简答

调用strip_tags()时使用"*"参数来指定要剥离的所有标签。

长答案

给定您的 XML 字符串,我们可以创建一个lxml 元素

>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)

您可以像这样检查该实例:

>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

要删除除parent标签本身之外的所有标签,请按照您的建议使用etree.strip_tags() function,但使用"*"参数:

>>> lxml.etree.strip_tags(parent_tag, "*")

检查显示所有子标签都消失了:

>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'

这是您想要的 output。 请注意,这将修改 lxml Element 实例本身! 为了使它更好(如你所问:-))只需获取text属性:

>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM