为什么lxml.etree.iterparse（）占用了我所有的记忆？

Question

This eventually consumes all my available memory and then the process is killed. 这最终消耗了我所有可用的内存，然后该进程被终止。 I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. 我已经尝试将标签从schedule更改为“较小”标签，但这并没有什么区别。

What am I doing wrong / how can I process this large file with iterparse() ? 我做错了什么/如何使用iterparse()处理这个大文件？

import lxml.etree

for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
    print "why does this consume all my memory?"

I can easily cut it up and process it in smaller chunks but that's uglier than I'd like. 我可以轻松地将其切割并以较小的块处理它，但这比我想要的更糟糕。

Answer 1

As iterparse iterates over the entire file a tree is built and no elements are freed. 由于iterparse遍历整个文件，因此iterparse了一个树，并且没有释放任何元素。 The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. 这样做的好处是元素可以记住父元素是谁，并且可以形成引用祖先元素的XPath。 The disadvantage is that it can consume a lot of memory. 缺点是它会占用大量内存。

In order to free some memory as you parse, use Liza Daly's fast_iter : 为了在解析时释放一些内存，请使用Liza Daly的fast_iter ：

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

which you could then use like this: 你可以这样使用：

def process_element(elem):
    print "why does this consume all my memory?"
context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events = ('end', ))
fast_iter(context, process_element)

I highly recommend the article on which the above fast_iter is based; 我强烈推荐上述fast_iter所依据的文章 ; it should be especially interesting to you if you are dealing with large XML files. 如果您处理大型XML文件，它应该特别有趣。

The fast_iter presented above is a slightly modified version of the one shown in the article. 上面提到的fast_iter是文章中显示的一个稍微修改过的版本。 This one is more aggressive about deleting previous ancestors, thus saves more memory. 这个更加积极地删除以前的祖先，从而节省更多的内存。 Here you'll find a script which demonstrates the difference. 在这里，您将找到一个演示差异的脚本。

Answer 2

Directly copied from http://effbot.org/zone/element-iterparse.htm 直接从http://effbot.org/zone/element-iterparse.htm复制

Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. 请注意，iterparse仍然构建一个树，就像解析一样，但是您可以在解析时安全地重新排列或删除树的一部分。 For example, to parse large files, you can get rid of elements as soon as you've processed them: 例如，要解析大文件，您可以在处理完元素后立即删除元素：

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; 上述模式有一个缺点; it does not clear the root element, so you will end up with a single element with lots of empty child elements. 它不会清除根元素，因此您最终会得到一个包含许多空子元素的元素。 If your files are huge, rather than just large, this might be a problem. 如果您的文件很大，而不是很大，这可能是个问题。 To work around this, you need to get your hands on the root element. 要解决这个问题，您需要掌握根元素。 The easiest way to do this is to enable start events, and save a reference to the first element in a variable: 最简单的方法是启用启动事件，并保存对变量中第一个元素的引用：

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

Answer 3

This worked really well for me: 这对我很有用：

def destroy_tree(tree):
    root = tree.getroot()

    node_tracker = {root: [0, None]}

    for node in root.iterdescendants():
        parent = node.getparent()
        node_tracker[node] = [node_tracker[parent][0] + 1, parent]

    node_tracker = sorted([(depth, parent, child) for child, (depth, parent)
                           in node_tracker.items()], key=lambda x: x[0], reverse=True)

    for _, parent, child in node_tracker:
        if parent is None:
            break
        parent.remove(child)

    del tree

为什么lxml.etree.iterparse（）占用了我所有的记忆？

问题描述

3 个解决方案

解决方案1
22 已采纳 2012-08-28 14:06:48

解决方案2
4 2012-08-28 14:12:11

解决方案3
0 2018-03-06 21:09:19

为什么lxml.etree.iterparse（）占用了我所有的记忆？

问题描述

3 个解决方案

解决方案1 22 已采纳 2012-08-28 14:06:48

解决方案2 4 2012-08-28 14:12:11

解决方案3 0 2018-03-06 21:09:19

解决方案1
22 已采纳 2012-08-28 14:06:48

解决方案2
4 2012-08-28 14:12:11

解决方案3
0 2018-03-06 21:09:19