how to "poll" python multiprocess pool apply_async

Question

I have a task function like this:

def task (s) :
    # doing some thing
    return res

The original program is:

res = []
for i in data :
    res.append(task(i))
    # using pickle to save res every 30s

I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing

pool = Pool(4)
status = []
res = []
for i in data :
    status.append(pool.apply_async(task, (i,))

for i in status :
    res.append(i.get())
    # using pickle to save res every 30s

Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0).... task(9)). If p0 takes a very long time to finish the task(0).

Does the main process be blocked at the first "res.append(i.get())"?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)

Answer 1

Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:

import time
import concurrent.futures


def func(x):
    time.sleep(3)
    return x ** 2


if __name__ == '__main__':
    data = range(1, 5)
    results = []

    with concurrent.futures.ProcessPoolExecutor(4) as ex:
        futures = [ex.submit(func, i) for i in data]
        # processing the earlier results: as they are completed
        for fut in concurrent.futures.as_completed(futures):
            res = fut.result()
            results.append(res)
            print(res)

Sample output:

Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:

def res_callback(v):
    # ... processing result
    with open('test.txt', 'a') as f:  # just an example
        f.write(str(v))
    print(v, flush=True)


if __name__ == '__main__':
    data = range(1, 5)
    results = []
    with Pool(4) as pool:
        tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
        # await for tasks finished

But that schema would still require to somehow await ( get() results) for submitted tasks.

how to "poll" python multiprocess pool apply_async

Question

1 answers

solution1
0 2023-01-28 15:55:33

how to "poll" python multiprocess pool apply_async

Question

1 answers

solution1 0 2023-01-28 15:55:33

solution1
0 2023-01-28 15:55:33