简体   繁体   中英

XML parsing using fast_iter clearing data before done processing

Im using Liza Daly's fast_iter which has the structure of:

def fast_iter(context, args=[], kwargs={}):
    """
    Deletes elements as the tree is travsersed to prevent the full tree from building and save memory
    Author: Liza Daly, IBM
    """
    for event, elem in context:
        if elem.tag == 'target':
            func(elem, *args, **kwargs)
            
            elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context
    return save

However, Ive noticed when i create my context as

context = etree.iterparse(path, events=('end',))

The data within the elem gets deleted before my function can even process it. For clarity, I am using fully synchronous code.

If I set my context as

context = etree.iterparse(path, events=('end',), tag='target')

It works correctly, however I know its not doing the full memory conservation that fast_iter is intended to provide.

Is there any reason to even use this when compared to xml.dom.pulldom , a SAX parser which creates no tree? It seems like fast_iter attempts to replicate this staying within lxml .

Does anyone have ideas on what im doing wrong? TIA

I think I can see where your approach might delete data you want to access before the code to access it is called, let's assume you have eg

<target>
  <foo>test</foo>
  <bar>test</bar>
</target>

elements in your XML, then each time an end element tag is found your code

for event, elem in context:
    if elem.tag == 'target':
        func(elem, *args, **kwargs)
        
        elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

is run, meaning it encounters the foo end element tag, then the bar end element tag where the while loop deletes the foo sibling element and then the target end element tag is encountered and I assume your function looks for both the foo and the bar element data but the foo element has been deleted.

So somehow your code has to take the structure (you probably know) into account and don't do that while loop for children/descendants of your target element.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM