简体   繁体   中英

How do I free up the memory used by an lxml.etree?

I'm loading data from a bunch of XML files with lxml.etree , but I'd like to close them once I'm done with this initial parsing. Currently the XML_FILES list in the below code takes up 350 MiB of the program's 400 MiB of used memory. I've tried del XML_FILES , del XML_FILES[:] , XML_FILES = None , for etree in XML_FILES: etree = None , and a few more, but none of these seem to be working. I also can't find anything in the lxml docs for closing an lxml file. Here's the code that does the parsing:

def open_xml_files():
    return [etree.parse(filename) for filename in paths]

def load_location_data(xml_files):
    location_data = {}

    for xml_file in xml_files:
        for city in xml_file.findall('City'):
            code = city.findtext('CityCode')
            name = city.findtext('CityName')
            location_data['city'][code] = name

        # [A few more like the one above]    

    return location_data

XML_FILES = utils.open_xml_files()
LOCATION_DATA = load_location_data(XML_FILES)
# XML_FILES never used again from this point on

Now, how do I get rid of XML_FILES here?

You might consider etree.iterparse , which uses a generator rather than an in-memory list. Combined with a generator expression, this might save your program some memory.

def open_xml_files():
    return (etree.iterparse(filename) for filename in paths)

iterparse creates a generator over the parsed contents of the file, while parse immediately parses the file and loads the contents into memory. The difference in memory usage comes from the fact that iterparse doesn't actually do anything until its next() method is called (in this case, implicitly via a for loop).

EDIT : Apparently iterparse does work incrementally, but doesn't free memory as is parses. You could use the solution from this answer to free memory as you traverse the xml document.

Given that the memory usage does not double the second time the file is parsed, if the structure has been deleted in between the parses (see comments), here's what's happening:

  • LXML wants memory, so calls malloc .
  • malloc wants memory, so requests this from the OS.
  • del deletes the structure as far as Python and LXML are concerned. However, malloc 's counterpart free does not actually give the memory back to the OS. Instead, it holds on to it to serve future requests.
  • The next time when LXML requests memory, malloc serves memory from the same region(s) that it got from the OS previously.

This is quite typical behavior for malloc implementations. memory_profiler only checks the process's total memory, including the parts reserved for reuse by malloc . With applications using big, contiguous chunks of memory (eg big NumPy arrays), that's fine because those are actually returned to the OS.(*) But for libraries like LXML that request lots of smaller allocations, memory_profiler will give an upper bound, not an exact figure.

(*) At least on Linux with Glibc. I'm not sure what MacOS and Windows do.

How about running the memory-consuming code as a separate process and leaving the task of releasing the memory to the operating system? In your case this should do the job:

from multiprocessing import Process, Queue

def get_location_data(q):
    XML_FILES = utils.open_xml_files()
    q.put(load_location_data(XML_FILES))

q = Queue()
p = Process(target=get_location_data, args=((q,)))
p.start()
result = q.get() # your location data
if p.is_alive():
    p.terminate()

The other solutions I found were very inefficient, but this worked for me:

def destroy_tree(tree):
    root = tree.getroot()

    node_tracker = {root: [0, None]}

    for node in root.iterdescendants():
        parent = node.getparent()
        node_tracker[node] = [node_tracker[parent][0] + 1, parent]

    node_tracker = sorted([(depth, parent, child) for child, (depth, parent)
                           in node_tracker.items()], key=lambda x: x[0], reverse=True)

    for _, parent, child in node_tracker:
        if parent is None:
            break
        parent.remove(child)

    del tree

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM