Periodically restart Python multiprocessing pool

Question

I have a Python multiprocessing pool doing a very long job that even after a thorough debugging is not robust enough not to fail every 24 hours or so, because it depends on many third-party, non-Python tools with complex interactions. Also, the underlying machine has certain problems that I cannot control. Note that by failing I don't mean the whole program crashing, but some or most of the processes becoming idle because of some errors, and the app itself either hanging or continuing the job just with the processes that haven't failed.

My solution right now is to periodically kill the job, manually, and then just restart from where it was.

Even if it's not ideal, what I want to do now is the following: restart the multiprocessing pool periodically, programatically, from the Python code itself. I don't really care if this implies killing the pool workers in the middle of their job. Which would be the best way to do that?

My code looks like:

with Pool() as p:
    for _ in p.imap_unordered(function, data):
        save_checkpoint()
        log()

What I have in mind would be something like:

start = 0
end = 1000  # magic number
while start + 1 < len(data):
    current_data = data[start:end]
    with Pool() as p:
        for _ in p.imap_unordered(function, current_data):
            save_checkpoint()
            log()
            start += 1
            end += 1

Or:

start = 0
end = 1000  # magic number
while start + 1 < len(data):
    current_data = data[start:end]
    start_timeout(time=TIMEOUT) # which would be the best way to to do that without breaking multiprocessing?
    try:
        with Pool() as p:
            for _ in p.imap_unordered(function, current_data):
                save_checkpoint()
                log()
                start += 1
                end += 1
    except Timeout:
        pass

Or any suggestion you think would be better. Any help would be much appreciated, thanks!

Answer 1

The problem with your current code is that it iterates the multiprocessed results directly, and that call will block. Fortunately there's an easy solution: use apply_async exactly as suggested in the docs . But because of how you describe the use-case here and the failure, I've adapted it somewhat. Firstly, a mock task:

from multiprocessing import Pool, TimeoutError, cpu_count
from time import sleep
from random import randint


def log():
    print("logging is a dangerous activity: wear a hard hat.")


def work(d):
    sleep(randint(1, 100) / 100)
    print("finished working")
    if randint(1, 10) == 1:
        print("blocking...")
        while True:
            sleep(0.1)

    return d

This work function will fail with a probabilty of 0.1 , blocking indefinitely. We create the tasks:

data = list(range(100))
nproc = cpu_count()

And then generate futures for all of them:

while data:
    print(f"== Processing {len(data)} items. ==")
    with Pool(nproc) as p:
        tasks = [p.apply_async(work, (d,)) for d in data]

Then we can try to get the tasks out manually:

        for task in tasks:
            try:
                res = task.get(timeout=1)
                data.remove(res)
                log()
            except TimeoutError:
                failed.append(task)
                if len(failed) < nproc:
                    print(
                        f"{len(failed)} processes are blocked,"
                        f" but {nproc - len(failed)} remain."
                    )
                else:
                    break

The controlling timeout here is the timeout to .get . It should be as long as you expect the longest process to take. Note that we detect when the whole pool is tied up and give up.

But since in the scenario you describe some threads are going to take longer than others, we can give 'failed' processes some time to recover. Thus every time a task fails we quickly check if the others have in fact succeeded:

            for task in failed:
                try:
                    res = task.get(timeout=0.01)
                    data.remove(res)
                    failed.remove(task)
                    log()
                except TimeoutError:
                    continue

Whether this is a good addition in your case depends on whether your tasks really are as flaky as I'm guessing they are.

Exiting the context manager for the pool will terminate the pool, so we don't even need to handle that ourselves. If you have significant variation you might want to increase the pool size (thus increasing the number of tasks which are allowed to stall) or allow tasks a grace period before considering them 'failed'.

Periodically restart Python multiprocessing pool

Question

1 answers

solution1
1 ACCPTED 2021-11-10 13:56:50

Periodically restart Python multiprocessing pool

Question

1 answers

solution1 1 ACCPTED 2021-11-10 13:56:50

solution1
1 ACCPTED 2021-11-10 13:56:50