xml.etree.ElementTree iterparse（）仍然使用大量内存？

Question

I've been experimenting with iterparse to reduce the memory footprint of my scripts that need to process large XML docs. 我一直在尝试使用iterparse来减少需要处理大型XML文档的脚本的内存占用量。 Here's an example. 这是一个例子。 I wrote this simple script to read a TMX file and split it into one or more output files not to exceed a user-specified size. 我编写了这个简单的脚本来读取TMX文件并将其拆分为一个或多个输出文件，不超过用户指定的大小。 Despite using iterparse, when I split a 886MB file into 100MB files, the script runs away with all available memory (grinding to a crawl at using 6.5 of my 8MB). 尽管使用iterparse，当我将886MB文件拆分为100MB文件时，脚本会以所有可用内存运行（使用我的8MB中的6.5进行爬行）。

Am I doing something wrong? 难道我做错了什么？ Why does the memory usage go so high? 为什么内存使用量如此之高？

#! /usr/bin/python
# -*- coding: utf-8 -*-
import argparse
import codecs
from xml.etree.ElementTree import iterparse, tostring
from sys import getsizeof

def startNewOutfile(infile, i, root, header):
    out = open(infile.replace('tmx', str(i) + '.tmx'), 'w')
    print >>out, '<?xml version="1.0" encoding="UTF-8"?>'
    print >>out, '<!DOCTYPE tmx SYSTEM "tmx14.dtd">'
    print >>out, roottxt
    print >>out, headertxt
    print >>out, '<body>'
    return out

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-m', '--maxsize', dest='maxsize', required=True, type=float, help='max size (in MB) of output files')
    parser.add_argument(dest='infile', help='.tmx file to be split')
    args = parser.parse_args()

    maxsize = args.maxsize * 1024 * 1024

    nodes = iter(iterparse(args.infile, events=['start','end']))

    _, root = next(nodes)
    _, header = next(nodes)

    roottxt = tostring(root).strip()
    headertxt = tostring(header).strip()

    i = 1
    curr_size = getsizeof(roottxt) + getsizeof(headertxt)
    out = startNewOutfile(args.infile, i, roottxt, headertxt)

    for event, node in nodes:
        if event =='end' and node.tag == 'tu':
            nodetxt = tostring(node, encoding='utf-8').strip()
            curr_size += getsizeof(nodetxt)
            print >>out, nodetxt
        if curr_size > maxsize:
            curr_size = getsizeof(roottxt) + getsizeof(headertxt)
            print >>out, '</body>'
            print >>out, '</tmx>'
            out.close()
            i += 1
            out = startNewOutfile(args.infile, i, roottxt, headertxt)
        root.clear()

    print >>out, '</body>'
    print >>out, '</tmx>'
    out.close()

Answer 1

Found the answer in a related question: Why is elementtree.ElementTree.iterparse using so much memory? 在相关问题中找到了答案：为什么elementtree.ElementTree.iterparse使用了这么多内存？

One needs not only root.clear(), but node.clear() at each iteration of the for loop. 在for循环的每次迭代中，不仅需要root.clear（），还需要node.clear（）。 Because we're processing both start & end events, though, we need to be careful not to remove tu nodes too soon: 但是，因为我们正在处理开始和结束事件，所以我们需要注意不要过早删除tu节点：

for e, node in nodes:
    if e == 'end' and node.tag == 'tu':
        nodetxt = tostring(node, encoding='utf-8').strip()
        curr_size += getsizeof(nodetxt)
        print >>out, nodetxt
        node.clear()
    if curr_size > maxsize:
        curr_size = getsizeof(roottxt) + getsizeof(headertxt)
        print >>out, '</body>'
        print >>out, '</tmx>'
        out.close()
        i += 1
        out = startNewOutfile(args.infile, i, roottxt, headertxt)
    root.clear()

xml.etree.ElementTree iterparse（）仍然使用大量内存？

问题描述

1 个解决方案

解决方案1
5 已采纳 2014-08-05 06:06:15

xml.etree.ElementTree iterparse（）仍然使用大量内存？

问题描述

1 个解决方案

解决方案1 5 已采纳 2014-08-05 06:06:15

解决方案1
5 已采纳 2014-08-05 06:06:15