简体   繁体   English

如何释放lxml.etree使用的内存?

[英]How do I free up the memory used by an lxml.etree?

I'm loading data from a bunch of XML files with lxml.etree , but I'd like to close them once I'm done with this initial parsing. 我正在使用lxml.etree从一堆XML文件中加载数据,但是一旦我完成了这个初始解析,我想关闭它们。 Currently the XML_FILES list in the below code takes up 350 MiB of the program's 400 MiB of used memory. 目前,以下代码中的XML_FILES列表占用了程序的400 MiB已用内存的350 MiB。 I've tried del XML_FILES , del XML_FILES[:] , XML_FILES = None , for etree in XML_FILES: etree = None , and a few more, but none of these seem to be working. 我已经尝试了del XML_FILESdel XML_FILES[:]XML_FILES = Nonefor etree in XML_FILES: etree = None ,还有一些,但这些似乎都不起作用。 I also can't find anything in the lxml docs for closing an lxml file. 我也在lxml文档中找不到关闭lxml文件的任何内容。 Here's the code that does the parsing: 这是解析的代码:

def open_xml_files():
    return [etree.parse(filename) for filename in paths]

def load_location_data(xml_files):
    location_data = {}

    for xml_file in xml_files:
        for city in xml_file.findall('City'):
            code = city.findtext('CityCode')
            name = city.findtext('CityName')
            location_data['city'][code] = name

        # [A few more like the one above]    

    return location_data

XML_FILES = utils.open_xml_files()
LOCATION_DATA = load_location_data(XML_FILES)
# XML_FILES never used again from this point on

Now, how do I get rid of XML_FILES here? 现在,我如何在这里摆脱XML_FILES?

You might consider etree.iterparse , which uses a generator rather than an in-memory list. 您可以考虑使用etree.iterparse ,它使用生成器而不是内存列表。 Combined with a generator expression, this might save your program some memory. 结合生成器表达式,这可能会为程序节省一些内存。

def open_xml_files():
    return (etree.iterparse(filename) for filename in paths)

iterparse creates a generator over the parsed contents of the file, while parse immediately parses the file and loads the contents into memory. iterparse在解析的文件内容上创建一个生成器,而parse立即解析该文件并将内容加载到内存中。 The difference in memory usage comes from the fact that iterparse doesn't actually do anything until its next() method is called (in this case, implicitly via a for loop). 内存使用的差异来自于iterparse在调用next()方法之前实际上没有做任何事情(在这种情况下,通过for循环隐式)。

EDIT : Apparently iterparse does work incrementally, but doesn't free memory as is parses. 编辑 :显然iterparse会逐步工作,但不会释放内存,就像解析一样。 You could use the solution from this answer to free memory as you traverse the xml document. 在遍历xml文档时,您可以使用此答案中的解决方案释放内存。

Given that the memory usage does not double the second time the file is parsed, if the structure has been deleted in between the parses (see comments), here's what's happening: 鉴于内存使用量不会在第二次解析文件时加倍,如果在解析之间删除了结构(请参阅注释),这里发生了什么:

  • LXML wants memory, so calls malloc . LXML想要内存,所以调用malloc
  • malloc wants memory, so requests this from the OS. malloc想要内存,所以请求操作系统。
  • del deletes the structure as far as Python and LXML are concerned. 就Python和LXML而言, del删除结构。 However, malloc 's counterpart free does not actually give the memory back to the OS. 但是, mallocfree对应实际上并没有将内存返回给操作系统。 Instead, it holds on to it to serve future requests. 相反,它坚持服务于未来的请求。
  • The next time when LXML requests memory, malloc serves memory from the same region(s) that it got from the OS previously. 下次当LXML请求内存时, malloc从之前从OS获得的相同区域提供内存。

This is quite typical behavior for malloc implementations. 这是malloc实现的非常典型的行为。 memory_profiler only checks the process's total memory, including the parts reserved for reuse by malloc . memory_profiler仅检查进程的总内存,包括保留供malloc重用的部分。 With applications using big, contiguous chunks of memory (eg big NumPy arrays), that's fine because those are actually returned to the OS.(*) But for libraries like LXML that request lots of smaller allocations, memory_profiler will give an upper bound, not an exact figure. 对于使用大的,连续的内存块(例如大NumPy数组)的应用程序,这很好,因为它们实际上返回到操作系统。(*)但是对于像LXML那样请求大量较小分配的库, memory_profiler会给出一个上限,而不是一个确切的数字。

(*) At least on Linux with Glibc. (*)至少在带有Glibc的Linux上。 I'm not sure what MacOS and Windows do. 我不确定MacOS和Windows是做什么的。

How about running the memory-consuming code as a separate process and leaving the task of releasing the memory to the operating system? 将内存消耗代码作为一个单独的进程运行并将内存释放到操作系统的任务怎么样? In your case this should do the job: 在你的情况下,这应该做的工作:

from multiprocessing import Process, Queue

def get_location_data(q):
    XML_FILES = utils.open_xml_files()
    q.put(load_location_data(XML_FILES))

q = Queue()
p = Process(target=get_location_data, args=((q,)))
p.start()
result = q.get() # your location data
if p.is_alive():
    p.terminate()

The other solutions I found were very inefficient, but this worked for me: 我发现的其他解决方案效率很低,但这对我有用:

def destroy_tree(tree):
    root = tree.getroot()

    node_tracker = {root: [0, None]}

    for node in root.iterdescendants():
        parent = node.getparent()
        node_tracker[node] = [node_tracker[parent][0] + 1, parent]

    node_tracker = sorted([(depth, parent, child) for child, (depth, parent)
                           in node_tracker.items()], key=lambda x: x[0], reverse=True)

    for _, parent, child in node_tracker:
        if parent is None:
            break
        parent.remove(child)

    del tree

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM