ThreadPool not releasing memory?

Question

When using Python's ThreadPool to parallelize a CPU-intensive task it seems like memory used by the workers is accumulated and not released. I've tried to simplified the problem:

import numpy as np
from multiprocessing.pool import ThreadPool

def worker(x):
    # Bloat the memory footprint of this function
    a = x ** x
    b = a + x
    c = x / b
    return hash(c.tobytes())   

tasks = (np.random.rand(1000, 1000) for _ in range(500))

with ThreadPool(4) as pool:
    for result in pool.imap(worker, tasks):
        assert result is not None

When running this snippet one can easily observe a huge jump in the memory footprint Python uses. However I would have expected this to have nearly the same behavior as

for task in tasks:
    assert worker(task) is not None

whose memory cost is negligible.

How do I have to modify the snippet to apply the worker function to each array using a ThreadPool ?

Answer 1

Turns out the explanation is quite simple. Modifying the the example to create the random array only inside the worker will solve the problem:

def worker(x):
    x = x()
    # Bloat the memory footprint of this function
    a = x ** x
    b = a + x
    c = x / b
    return hash(c.tobytes())

tasks = (lambda: np.random.rand(1000, 1000) for _ in range(500))

It seems like ThreadPools.imap will internally turn the generator tasks into a list or something alike. This would of course require to store all 500 random arrays in memory at once.

ThreadPool not releasing memory?

Question

1 answers

solution1
0 ACCPTED 2019-02-02 18:14:16

ThreadPool not releasing memory?

Question

1 answers

solution1 0 ACCPTED 2019-02-02 18:14:16

solution1
0 ACCPTED 2019-02-02 18:14:16