Quickest/Best way to traverse XML with lxml in Python

Question

I have an XML file that looks like this:

xml = '''<?xml version="1.0"?>
        <root>
            <item>text</item>
            <item2>more text</item2>
            <targetroot>
                <targetcontainer>
                    <target>text i want to get</target>
                </targetcontainer>
                <targetcontainer>
                    <target>text i want to get</target>
                </targetcontainer>
            </targetroot>
            ...more items
        </root>
'''

With lxml I'm trying to acces the text in the element < target >. I've found a solution, but I'm sure there is a better, more efficient way to do this. My solution:

target = etree.XML(xml)

for x in target.getiterator('root'):
    item1 = x.findtext('item')
    for target in x.iterchildren('targetroot'):
        for t in target.iterchildren('targetcontainer'):
            targetText = t.findtext('target')

Although this works, as it gives me acces to all the elements in root as well as the target element, I'm having a hard time believing this is the most efficient solution.

So my question is this: is there a more efficient way to access the < target >'s texts while staying in the loop of root, because I also need access to the other elements.

Answer 1

You can use XPath :

for x in target.xpath('/root/targetroot/targetcontainer/target'):
    print x.text

We ask all elements that match a path . In this case, the path is /root/targetroot/targetcontainer/target , which means

all the <target> elements that are inside a <targetcontainer> element, inside a <targetroot> element, inside a <root> element. Also, the <root> element should be the document root because it is preceded by / , which means the beginning of the document.

Also, your XML document had two problems. First, the <?xml version="1.0"?> declaration should be the very first thing in the document - and in this example it is preceded by a newline and some space. Also, it is not a tag and should not be closed, so the </xml> at the end of your string should be removed. I already edited your question anyway.

EDIT : this solution can be improved yet. You do not need to pass all the path - you can just ask to all elements <target> inside the document. This is done by preceding the tag name by two slashes. Since you want all the <target> texts, independent of where they are, this can be a better solution. So, the loop above can be written just as:

for x in target.xpath('//target'):
    print x.text

I tried it at first but it did not worked. The problem, however, was the syntax problems in the XML, not the XPath, but I tried the other, longer path and forgot to retry this one. Sorry! Anyway, I hope I put some light about XPath nonetheless :)

Quickest/Best way to traverse XML with lxml in Python

Question

1 answers

solution1
3 ACCPTED 2011-12-17 23:11:34

Quickest/Best way to traverse XML with lxml in Python

Question

1 answers

solution1 3 ACCPTED 2011-12-17 23:11:34

solution1
3 ACCPTED 2011-12-17 23:11:34