Python ElementTree

Question

I have an XML file with the following format:

<dir name="A">
    <dir name="B">
        <file name="foo.txt"/>
    </dir>
    <dir name="C">
        <dir name="D">
            <file name="bar.txt"/>
        </dir>
    </dir>
</dir>
<dir name="E">
    <file name="bat.txt"/>
    <file name="cat.txt"/>
</dir>
<dir name="F">
    <dir name="G">
        <file name="dog.txt"/>
        <file name="rabbit.txt"/>
    </dir>
</dir>

I would like to use the python ElementTree module in order to remove any elements that contain a element inside them. That is, I would like to get the inner elements of the XML file (the ones that do not contain another element inside them), along with all their children. I want any such element to be set to the outer level. For example, for the above XML file, the corresponding output file would be:

<dir name="B">
    <file name="foo.txt"/>
</dir>
<dir name="D">
    <file name="bar.txt"/>
</dir>
<dir name="E">
    <file name="bat.txt"/>
    <file name="cat.txt"/>
</dir>
<dir name="G">
    <file name="dog.txt"/>
    <file name="rabbit.txt"/>
</dir>

How can I achieve this?

Answer 1

Notice the order in which the Elements are visited when you use iterparse -- it's a depth-first search:

import xml.etree.ElementTree as ET

with open('data', 'rb') as f:
    context = ET.iterparse(f, events=('start', 'end'))
    for event, elem in context:
        if elem.tag == 'dir':
            name = elem.get('name')
            print(event, name)

yields

('start', 'A')
('start', 'B')     <-- ('start', 'B') is follow immediately by ('end', 'B')
('end', 'B')       <--
('start', 'C')
('start', 'D')     <-- start is follow immediately by end
('end', 'D')
('end', 'C')
('end', 'A')
('start', 'E')     <-- start is follow immediately by end
('end', 'E')
('start', 'F')
('start', 'G')     <-- start is follow immediately by end
('end', 'G')
('end', 'F')

Ahah, the Elements you are looking for -- the most deeply nested dir Elements -- are the ones which iterparse visits first with a start event and followed immediately with an end event (at least when we only look at dir Elements).

So using this idea, we can then collect those Elements in a new root Element to obtain the desired XML:

root = ET.Element('root')
previous_name = None
with open('data', 'rb') as f:
    context = ET.iterparse(f, events=('start', 'end'))
    for event, elem in context:
        if elem.tag == 'dir':
            name = elem.get('name')
            if event == 'start':
                previous_name = name
            elif previous_name == name:
                root.append(elem)
print(ET.tostring(root))

yields

<root><dir name="B">
        <file name="foo.txt" />
    </dir>
    <dir name="D">
            <file name="bar.txt" />
        </dir>
    <dir name="E">
    <file name="bat.txt" />
    <file name="cat.txt" />
</dir>
<dir name="G">
        <file name="dog.txt" />
        <file name="rabbit.txt" />
    </dir>
</root>

Note that the iterparse code above does not clear any elements after they are visited by iterparse. If your XML is huge, using iterparse without clearing any elements may use too much memory. In that case, for both performance and better memory management, I'd use lxml and fast_iter .

Python ElementTree

Question

1 answers

solution1
0 2014-10-05 22:01:15

Python ElementTree

Question

1 answers

solution1 0 2014-10-05 22:01:15

solution1
0 2014-10-05 22:01:15