简体   繁体   中英

Clearing memory between memory-intensive procedures in Python

I need to sequentially read a large text files, storing a lot of data in memory, and then use them to write a large file. These read/write cycles are done one at a time, and there is no common data, so I don't need any of the memory to be shared between them.

I tried putting these procedures in a single script, hoping that the garbage collector would delete the old, no-longer-needed objects when the RAM got full. However, this was not the case. Even when I explicitly deleted the objects between cycles it would take far longer than running the procedures separately.

Specifically, the process would hang, using all available RAM but almost no CPU. It also hung when gc.collect() was called. So, I decided to split each read/write procedure into separate scripts and call them from a central script using execfile() . This didn't fix anything, sadly; the memory still piled up.

I've used the simple, obvious solution, which is to simply call the subscripts from a shell script rather than using execfile() . However, I would like to know if there is a way to make this work. Any input?

Any CPython object that has no references is immediately freed. Periodically, Python does a garbage collection to take care of groups of objects that refer only to each other but are not reachable by a program (cyclic references). You can call the garbage collector manually to clear those up if it needs to be done at a particular time ( gc.collect() ). This makes the memory available for reuse by your Python script but may or may not immediately (or ever) release that memory back to the operating system.

CPython allocates memory in 256KB arenas that it divides into 4KB pools, which are further subdivided into blocks, which are designated for particular sizes of objects (these will generally be of similar type but need not be). This memory can be reused within the Python process but it doesn't get released back to the operating system until the entire arena is empty.

Now, before 2005 some commonly-used types of objects didn't use this scheme. For example, once you create an 'int' or a 'float', that memory was never returned to the OS even if it was freed by Python, but it could be reused for other objects of these types. (Of course small int s are shared and don't take up any extra memory, but if you allocated, say, a list of large int s, or of float s, that memory would be retained by CPython even after those objects are freed.) Python also retained some memory allocated by lists and dictionaries (eg the most recent 80 lists).

This is all according to this document about improvements made to the Python memory allocator circa version 2.3. I understand some further work has been done since then, so some of the details may have changed (the int / float situation has been rectified according to arbautjc's comment below) but the basic situation remains: for performance reasons, Python does not return all memory to the OS immediately, because malloc() has relatively high overhead for small allocations, and gets slower the more fragmented memory is. Thus Python only mallocs() large-ish chunks of memory and allocates memory within those chunks itself, and only returns these chunks to the OS when they are completely empty.

You might try alternative Python implementations, such as PyPy (which aims to be as compatible with CPython as possible), Jython (runs on JVM), or IronPython (runs on .NET CLR) to see if their memory management is more copacetic with what you're doing. If you are currently using a 32-bit Python, you could try a 64-bit one (assuming your CPU and OS support it).

However, your approach of calling your scripts sequentially from a shell script seems perfectly fine to me. You could use the subprocess module to write the master script in Python, but it's probably simpler in the shell.

Without knowing more about what your script is doing, though, it's hard to guess what is causing this situation.

Usually in this kind of situation, refactoring is the only way out.

You mentioned you're storing a lot in memory, perhaps in a dict or a set, then output onto only one file.

Maybe you can append output to the output file after processing each input, then do a quick clean-up before processing new input file. That way, RAM usage can be reduced.

Appending can even be done line by line from input, so that no storage is needed.

Since I don't know the specific algorithm you're using, given you mentioned no sharing between files is needed, this may help. Remember to flush output too :P

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM