简体   繁体   English

xml.etree.ElementTree iterparse()仍然使用大量内存?

[英]xml.etree.ElementTree iterparse() still using lots of memory?

I've been experimenting with iterparse to reduce the memory footprint of my scripts that need to process large XML docs. 我一直在尝试使用iterparse来减少需要处理大型XML文档的脚本的内存占用量。 Here's an example. 这是一个例子。 I wrote this simple script to read a TMX file and split it into one or more output files not to exceed a user-specified size. 我编写了这个简单的脚本来读取TMX文件并将其拆分为一个或多个输出文件,不超过用户指定的大小。 Despite using iterparse, when I split a 886MB file into 100MB files, the script runs away with all available memory (grinding to a crawl at using 6.5 of my 8MB). 尽管使用iterparse,当我将886MB文件拆分为100MB文件时,脚本会以所有可用内存运行(使用我的8MB中的6.5进行爬行)。

Am I doing something wrong? 难道我做错了什么? Why does the memory usage go so high? 为什么内存使用量如此之高?

#! /usr/bin/python
# -*- coding: utf-8 -*-
import argparse
import codecs
from xml.etree.ElementTree import iterparse, tostring
from sys import getsizeof

def startNewOutfile(infile, i, root, header):
    out = open(infile.replace('tmx', str(i) + '.tmx'), 'w')
    print >>out, '<?xml version="1.0" encoding="UTF-8"?>'
    print >>out, '<!DOCTYPE tmx SYSTEM "tmx14.dtd">'
    print >>out, roottxt
    print >>out, headertxt
    print >>out, '<body>'
    return out

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-m', '--maxsize', dest='maxsize', required=True, type=float, help='max size (in MB) of output files')
    parser.add_argument(dest='infile', help='.tmx file to be split')
    args = parser.parse_args()

    maxsize = args.maxsize * 1024 * 1024

    nodes = iter(iterparse(args.infile, events=['start','end']))

    _, root = next(nodes)
    _, header = next(nodes)

    roottxt = tostring(root).strip()
    headertxt = tostring(header).strip()

    i = 1
    curr_size = getsizeof(roottxt) + getsizeof(headertxt)
    out = startNewOutfile(args.infile, i, roottxt, headertxt)

    for event, node in nodes:
        if event =='end' and node.tag == 'tu':
            nodetxt = tostring(node, encoding='utf-8').strip()
            curr_size += getsizeof(nodetxt)
            print >>out, nodetxt
        if curr_size > maxsize:
            curr_size = getsizeof(roottxt) + getsizeof(headertxt)
            print >>out, '</body>'
            print >>out, '</tmx>'
            out.close()
            i += 1
            out = startNewOutfile(args.infile, i, roottxt, headertxt)
        root.clear()

    print >>out, '</body>'
    print >>out, '</tmx>'
    out.close()

Found the answer in a related question: Why is elementtree.ElementTree.iterparse using so much memory? 在相关问题中找到了答案: 为什么elementtree.ElementTree.iterparse使用了这么多内存?

One needs not only root.clear(), but node.clear() at each iteration of the for loop. 在for循环的每次迭代中,不仅需要root.clear(),还需要node.clear()。 Because we're processing both start & end events, though, we need to be careful not to remove tu nodes too soon: 但是,因为我们正在处理开始和结束事件,所以我们需要注意不要过早删除tu节点:

for e, node in nodes:
    if e == 'end' and node.tag == 'tu':
        nodetxt = tostring(node, encoding='utf-8').strip()
        curr_size += getsizeof(nodetxt)
        print >>out, nodetxt
        node.clear()
    if curr_size > maxsize:
        curr_size = getsizeof(roottxt) + getsizeof(headertxt)
        print >>out, '</body>'
        print >>out, '</tmx>'
        out.close()
        i += 1
        out = startNewOutfile(args.infile, i, roottxt, headertxt)
    root.clear()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM