简体   繁体   中英

Remove element from XML with ElementTree in python

I have a.tmx file, and I want to extract the text from the seg tag, however because of the inside tags such as bpt and ept, I cannot extract this text. So I would like to remove the bpt tag completely. I tried.remove() method. However, this also removes the text.

I cannot use BeautifulSoup because my original file is.tmx

ElementTree does not keep parent references in the XML tree. That's inconvenient but not the end of the world.

But in order to delete any node in an XML document, you need to delete it from its parent, so you need a way to get the parent node.

Easiest for ElementTree is to iterate all potential parents and then check each parent if it has a child you want to delete.

Assuming <bpt> is always a child of <seg> , that would mean iterating the <seg> elements:

for node in root.iter('seg'):
    prev = None
    for child in list(node):
        if child.tag == 'bpt':
            # retain child node's tail, if any
            if child.tail is not None:
                if prev is None:
                    node.text = (node.text if node.text else '') + child.tail 
                else:
                    prev.tail = (prev.tail if prev.tail else '') + child.tail
            node.remove(child)
        else:
            prev = child

If <bpt> could be anywhere , changing the above to for node in root.iter(): iterates all nodes.

Explanation

ElementTree sub-divides the document tree in a very proprietary manner. One main drawback is that there are no "parent" references - relative navigation between nodes is very limited in general - another is that there are no text nodes.

Instead of being a stand-alone node, any text after an element (ie text directly following the closing </tag> ) becomes a property of that element, called .tail :

<!-- <bpt> elements and their "tails" -->

<seg><bpt i="1">{\\f3 </bpt>Cover page <ept i="1">}</ept><bpt i="2">{\\f2 </bpt>U1 - Insert graphic<ept i="2">}</ept></seg>
<!-- -----------------------^^^^^^^^^^^                  -----------------------^^^^^^^^^^^^^^^^^^^                     -->

Consequently, if we remove the <bpt> element, the tail is lost, too. In order to save it, we must add the content to the preceding element's tail (as with "U1 - Insert graphic" , which now belongs to the <ept> ), or if there is no preceding element, to the parent element's text (as with "Cover page " , which now belongs to the <seg> ):

<!-- <bpt> elements removed, "tails" moved one to the front -->

<seg>Cover page <ept i="1">}</ept>U1 - Insert graphic<ept i="2">}</ept></seg>
<!-- ^^^^^^^^^^^                  ^^^^^^^^^^^^^^^^^^^                     -->

Repeating the same removal process with <ept> would lead to the follwing - all "tails" are now merged into <seg> 's text:

<seg>Cover page U1 - Insert graphic</seg>
<!-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   -->

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM