简体   繁体   中英

ThreadPool not releasing memory?

When using Python's ThreadPool to parallelize a CPU-intensive task it seems like memory used by the workers is accumulated and not released. I've tried to simplified the problem:

import numpy as np
from multiprocessing.pool import ThreadPool

def worker(x):
    # Bloat the memory footprint of this function
    a = x ** x
    b = a + x
    c = x / b
    return hash(c.tobytes())   

tasks = (np.random.rand(1000, 1000) for _ in range(500))

with ThreadPool(4) as pool:
    for result in pool.imap(worker, tasks):
        assert result is not None

When running this snippet one can easily observe a huge jump in the memory footprint Python uses. However I would have expected this to have nearly the same behavior as

for task in tasks:
    assert worker(task) is not None

whose memory cost is negligible.

How do I have to modify the snippet to apply the worker function to each array using a ThreadPool ?

Turns out the explanation is quite simple. Modifying the the example to create the random array only inside the worker will solve the problem:

def worker(x):
    x = x()
    # Bloat the memory footprint of this function
    a = x ** x
    b = a + x
    c = x / b
    return hash(c.tobytes())

tasks = (lambda: np.random.rand(1000, 1000) for _ in range(500))

It seems like ThreadPools.imap will internally turn the generator tasks into a list or something alike. This would of course require to store all 500 random arrays in memory at once.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM