简体   繁体   English

lxml和fast_iter占用了所有内存

[英]lxml and fast_iter eating all the memory

I want to parse a 1.6 GB XML file with Python (2.7.2) using lxml (3.2.0) on OS X (10.8.2). 我想在OS X(10.8.2)上使用lxml(3.2.0)使用Python(2.7.2)解析1.6 GB的XML文件。 Because I had already read about potential issues with memory consumption, I already use fast_iter in it, but after the main loop, it eats up about 8 GB RAM, even it doesn't keep any data from the actual XML file. 因为我已经阅读了有关内存消耗的潜在问题,所以我已经在其中使用了fast_iter ,但是在主循环之后,它占用了大约8 GB的RAM,即使它没有保留实际XML文件中的任何数据。

from lxml import etree

def fast_iter(context, func, *args, **kwargs):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    for event, elem in context:
        func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem):
    pass

context = etree.iterparse("sachsen-latest.osm", tag="node", events=("end", ))
fast_iter(context, process_element)

I don't get, why there is such a massive leakage, because the element and the whole context is being deleted in fast_iter() and at the moment I don't even process the XML data. 我不明白为什么会发生如此大的泄漏,因为在fast_iter()删除了元素和整个上下文,现在我什至不处理XML数据。

Any ideas? 有任何想法吗?

The problem is with the behavior of etree.iterparse() . 问题在于etree.iterparse()的行为。 You would think it only uses memory for each node element, but it turns out it still keeps all the other elements in memory. 您可能会认为它仅对每个node元素使用内存,但事实证明它仍将所有其他元素保留在内存中。 Since you don't clear them, memory ends up blowing up later on, specially when parsing .osm (OpenStreetMaps) files and looking for nodes, but more on that later. 由于您没有清除它们,因此内存最终会消耗blowing尽,特别是在解析.osm(OpenStreetMaps)文件并查找节点时,尤其是稍后。

The solution I found was not to catch node tags but catch all tags: 我发现的解决方案不是捕获node标签,而是捕获所有标签:

context = etree.iterparse(open(filename,'r'),events=('end',))

And then clear all the tags, but only parse the ones you are interested in: 然后清除所有标签,但仅解析您感兴趣的标签:

for (event,elem) in progress.bar(context):
    if elem.tag == 'node':
        # do things here

    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
del context

Do keep in mind that it may delete other elements that you are interested in, so make sure to add more ifs where needed. 请记住,它可能会删除您感兴趣的其他元素,因此请确保在需要的地方添加更多。 For example (And this is .osm specific) tags nested from nodes 例如(这是.osm特有的)从nodes嵌套的tags

if elem.tag == 'tag':
    continue
if elem.tag == 'node':
    for tag in elem.iterchildren():
        # do stuff

The reason why memory was blowing up later is pretty interesting, .osm files are organized in a way that nodes come first, then ways then relations . 为什么内存后来炸毁的原因是非常有趣的,.osm文件被组织的方式, nodes是第一位的,然后waysrelations So your code does fine with nodes at the beginning, then memory gets filled as etree goes through the rest of the elements. 因此,您的代码在开始时就可以很好地处理节点,然后当etree遍历其余元素时,内存将被填充。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM