如何去除 xml 标记中的所有子标记，但使用 python 中的 lxml 将文本合并到括号？

Question

How can one tell etree.strip_tags() to strip all possible tags from a given tag element?如何告诉etree.strip_tags()从给定的标签元素中去除所有可能的标签？

Do I have to map them myself, like:我是否必须自己对它们进行 map，例如：

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)

Perhaps a more elegant approach I don't know of?也许我不知道一种更优雅的方法？

Example input:示例输入：

parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"

Desired Output:所需的 Output：

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

or even better:甚至更好：

This is some text with multiple tags and sometimes they are nested.

Answer 1

You can use the lxml.html.clean module:您可以使用lxml.html.clean模块：

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

Answer 2

This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.这个答案有点晚了，但我想一个比ars最初答案提供的更简单的解决方案可能会方便保管。

Short Answer简答

Use the "*" argument when you call strip_tags() to specify all tags to be stripped.调用strip_tags()时使用"*"参数来指定要剥离的所有标签。

Long Answer长答案

Given your XML string, we can create an lxml Element :给定您的 XML 字符串，我们可以创建一个lxml 元素：

>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)

You can inspect that instance like so:您可以像这样检查该实例：

>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

To strip out all the tags except the parent tag itself, use the etree.strip_tags() function like you suggested, but with a "*" argument:要删除除parent标签本身之外的所有标签，请按照您的建议使用etree.strip_tags() function，但使用"*"参数：

>>> lxml.etree.strip_tags(parent_tag, "*")

Inspection shows that all child tags are gone:检查显示所有子标签都消失了：

>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'

Which is your desired output.这是您想要的 output。 Note that this will modify the lxml Element instance itself!请注意，这将修改 lxml Element 实例本身！ To make it even better (as you asked:-)) just grab the text property:为了使它更好（如你所问:-)）只需获取text属性：

>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'

如何去除 xml 标记中的所有子标记，但使用 python 中的 lxml 将文本合并到括号？

问题描述

2 个解决方案

解决方案1
5 2011-07-07 19:52:16

解决方案2
2 2014-04-18 16:21:09

如何去除 xml 标记中的所有子标记，但使用 python 中的 lxml 将文本合并到括号？

问题描述

2 个解决方案

解决方案1 5 2011-07-07 19:52:16

解决方案2 2 2014-04-18 16:21:09

解决方案1
5 2011-07-07 19:52:16

解决方案2
2 2014-04-18 16:21:09