[英]how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?
How can one tell etree.strip_tags()
to strip all possible tags from a given tag element?如何告诉etree.strip_tags()
从给定的标签元素中去除所有可能的标签?
Do I have to map them myself, like:我是否必须自己对它们进行 map,例如:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Perhaps a more elegant approach I don't know of?也许我不知道一种更优雅的方法?
Example input:示例输入:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Desired Output:所需的 Output:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
or even better:甚至更好:
This is some text with multiple tags and sometimes they are nested.
You can use the lxml.html.clean
module:您可以使用lxml.html.clean
模块:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.这个答案有点晚了,但我想一个比ars最初答案提供的更简单的解决方案可能会方便保管。
Short Answer简答
Use the "*"
argument when you call strip_tags()
to specify all tags to be stripped.调用strip_tags()
时使用"*"
参数来指定要剥离的所有标签。
Long Answer长答案
Given your XML string, we can create an lxml Element :给定您的 XML 字符串,我们可以创建一个lxml 元素:
>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)
You can inspect that instance like so:您可以像这样检查该实例:
>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
To strip out all the tags except the parent
tag itself, use the etree.strip_tags()
function like you suggested, but with a "*"
argument:要删除除parent
标签本身之外的所有标签,请按照您的建议使用etree.strip_tags()
function,但使用"*"
参数:
>>> lxml.etree.strip_tags(parent_tag, "*")
Inspection shows that all child tags are gone:检查显示所有子标签都消失了:
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'
Which is your desired output.这是您想要的 output。 Note that this will modify the lxml Element instance itself!请注意,这将修改 lxml Element 实例本身! To make it even better (as you asked:-)) just grab the text
property:为了使它更好(如你所问:-))只需获取text
属性:
>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.