lxml.etree iterparse() and parsing element completely

Question

I have an XML file with nodes that looks like this:

<trkpt lat="-37.7944415" lon="144.9616159">
  <ele>41.3681107</ele>
  <time>2015-04-11T03:52:33.000Z</time>
  <speed>3.9598</speed>
</trkpt>

I am using lxml.etree.iterparse() to iteratively parse the tree. I loop over each trkpt element's children and want to print the text value of the children nodes. Eg

for event, element in etree.iterparse(infile, events=("start", "end")):
    if element.tag == NAMESPACE + 'trkpt':
        for child in list(element):
            print child.text

The problem is that at this stage the node has no text, so the output of the print is 'None'.

I have validated this by replacing the 'print child.text' statement with 'print etree.tostring(child)' and the output looks like this

<ele/>
<time/>
<speed/>

According to the documentation, "Note that the text, tail, and children of an Element are not necessarily present yet when receiving the start event. Only the end event guarantees that the Element has been parsed completely."

So I changed my for loop to this, note the 'if event == "end":' statement

for event, element in etree.iterparse(infile, events=("start", "end")):
    if element.tag == NAMESPACE + 'trkpt':
        if event == "end":
            for child in list(element):
                print child.text

But I am still getting the same results. Any help would be greatly appreciated.

Answer 1

are you trying to use iterparse explicitly or can you use other methods.

e.g.

from lxml import etree

tree = etree.parse('/path/to/file')
root = tree.getroot()
for elements in root.findall('trkpt'):
    for child in elements:
        print child.text

lxml is pretty good at parsing and not taking up too much memory...not sure if this solves your problem or if you are trying to use the specific method above.

Answer 2

Are you sure that you don't call eg element.clear() after your conditional statement, like this?

for event, element in etree.iterparse(infile, events=("start", "end")):
  if element.tag == NAMESPACE + 'trkpt' and event == 'end':
    for child in list(element):
        print child.text
  element.clear()

The problem is that the parser issues the events for the child elements before it sends the end event for trkpt (because it encounters the end tags of the nested elements first). If you do any modifications to the parsed elements before the end event is called for the outer element, the behaviour you describe may occur.

Consider the following alternative:

for event, element in etree.iterparse(infile, events=('end',),
    tag=NAMESPACE + 'trkpt'):
  for child in element:
     print child.text
  element.clear()

lxml.etree iterparse() and parsing element completely

Question

2 answers

solution1
0 2015-05-13 17:04:00

solution2
0 2015-11-22 16:11:42

lxml.etree iterparse() and parsing element completely

Question

2 answers

solution1 0 2015-05-13 17:04:00

solution2 0 2015-11-22 16:11:42

solution1
0 2015-05-13 17:04:00

solution2
0 2015-11-22 16:11:42