简体   繁体   English

如何删除lxml中的元素

[英]how to remove an element in lxml

I need to completely remove elements, based on the contents of an attribute, using python's lxml.我需要使用 python 的 lxml 根据属性的内容完全删除元素。 Example:例子:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

I would like this to print:我想打印:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Is there a way to do this without storing a temporary variable and printing to it manually, as:有没有办法在不存储临时变量并手动打印的情况下做到这一点,如:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

Use the remove method of an xmlElement :使用 xmlElement 的remove方法:

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

If I had to compare with the @Acorn version, mine will work even if the elements to remove are not directly under the root node of your xml.如果我必须与@Acorn 版本进行比较,即使要删除的元素不直接位于 xml 的根节点下,我的版本也能正常工作。

You're looking for the remove function.您正在寻找remove功能。 Call the tree's remove method and pass it a subelement to remove.调用树的 remove 方法并将其传递给要删除的子元素。

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

Result:结果:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

I met one situation:我遇到过一种情况:

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script) will remove the text here part which I didn't mean to. div.remove(script)将删除text heretext here部分,我不是故意的。

following the answer here , I found that etree.strip_elements is a better solution for me, which you can control whether or not you will remove the text behind with with_tail=(bool) param.按照这里的答案,我发现etree.strip_elements对我来说是一个更好的解决方案,您可以控制是否使用with_tail=(bool)参数删除后面的文本。

But still I don't know if this can use xpath filter for tag.但我仍然不知道这是否可以使用 xpath 过滤器进行标记。 Just put this for informing.只是把这个通知。

Here is the doc:这是文档:

strip_elements(tree_or_element, *tag_names, with_tail=True) strip_elements(tree_or_element, *tag_names, with_tail=True)

Delete all elements with the provided tag names from a tree or subtree.从树或子树中删除具有提供的标签名称的所有元素。 This will remove the elements and their entire subtree, including all their attributes, text content and descendants.这将删除元素及其整个子树,包括它们的所有属性、文本内容和后代。 It will also remove the tail text of the element unless you explicitly set the with_tail keyword argument option to False.它还将删除元素的尾部文本,除非您将with_tail关键字参数选项显式设置为 False。

Tag names can contain wildcards as in _Element.iter .标记名称可以包含通配符,如_Element.iter

Note that this will not delete the element (or ElementTree root element) that you passed even if it matches.请注意,即使匹配,这也不会删除您传递的元素(或 ElementTree 根元素)。 It will only treat its descendants.它只会对待它的后代。 If you want to include the root element, check its tag name directly before even calling this function.如果要包含根元素,请在调用此函数之前直接检查其标记名称。

Example usage::示例用法::

 strip_elements(some_element, 'simpletagname', # non-namespaced tag '{http://some/ns}tagname', # namespaced tag '{http://some/other/ns}*' # any tag from a namespace lxml.etree.Comment # comments )

As already mentioned, you can use the remove() method to delete (sub)elements from the tree:如前所述,您可以使用remove()方法从树中删除(子)元素:

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

But it removes the element including its tail , which is a problem if you are processing mixed-content documents like HTML:但是它会删除元素,包括它的tail ,如果您正在处理混合内容文档(如 HTML),这是一个问题:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

Becomes成为

<div></div>

Which is I suppose what you not always want :) I have created helper function to remove just the element and keep its tail:这是我想你并不总是想要的 :) 我创建了辅助函数来只删除元素并保留它的尾巴:

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

This way it will keep the tail text:这样它将保留尾部文本:

<div> Hello!</div>

You could also use html from lxml to solve that:您还可以使用 lxml 中的 html 来解决该问题:

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

It should output this:它应该输出这个:

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>

The remove function detaches an element from the tree and therefore removes the XML node (Element, PI or Comment), its content (the descendant items) and the tail text. remove函数从树中分离一个元素,因此删除 XML 节点(元素、PI 或注释)、其内容(后代项)和tail文本。 Here, preserving the tail text is superfluous because it only contains whitespaces and a newline, which can be considered ignorable whitespaces.在这里,保留tail文本是多余的,因为它只包含空格和换行符,可以认为是可忽略的空格。

To remove a element (and its content), preserving its tail , you can use the following function:要删除元素(及其内容),保留其tail ,您可以使用以下函数:

def remove_node(child, keep_content=False):
    """
    Remove an XML element, preserving its tail text.

    :param child: XML element to remove
    :param keep_content: ``True`` to keep child text and sub-elements.
    """
    parent = child.getparent()
    parent_text = parent.text or u""
    prev_node = child.getprevious()
    if keep_content:
        # insert: child text
        child_text = child.text or u""
        if prev_node is None:
            parent.text = u"{0}{1}".format(parent_text, child_text) or None
        else:
            prev_tail = prev_node.tail or u""
            prev_node.tail = u"{0}{1}".format(prev_tail, child_text) or None
        # insert: child elements
        index = parent.index(child)
        parent[index:index] = child[:]
    # insert: child tail
    parent_text = parent.text or u""
    prev_node = child.getprevious()
    child_tail = child.tail or u""
    if prev_node is None:
        parent.text = u"{0}{1}".format(parent_text, child_tail) or None
    else:
        prev_tail = prev_node.tail or u""
        prev_node.tail = u"{0}{1}".format(prev_tail, child_tail) or None
    # remove: child
    parent.remove(child)

Here is a demo:这是一个演示:

from lxml import etree

tree = etree.XML(u"<root>text <bad>before <bad>inner</bad> after</bad> tail</root>")
bad1 = tree.xpath("//bad[1]")[0]
remove_node(bad1)

etree.dump(tree)
# <root>text  tail</root>

If you want to preserve the content, you can do:如果要保留内容,可以执行以下操作:

tree = etree.XML(u"<root>text <bad>before <bad>inner</bad> after</bad> tail</root>")
bad1 = tree.xpath("//bad[1]")[0]
remove_node(bad1, keep_content=True)

etree.dump(tree)
# <root>text before <bad>inner</bad> after tail</root>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM