简体   繁体   中英

How can I parallelize a memoized call to an external program?

I have a data-processing program written in Python, which needs to call out to an external program during one stage. Profiling shows that about 50% of the total processing time is spent in this one stage.

I have a computer with multiple cores, so parallelism seems like the solution. The problem is, the call is memoized:

def one_stage_of_processing(long_list_of_inputs, cache={}):
    outputs = []
    for input in list_of_inputs:
        outputs.append(expensive_external_processing(input, cache))
    return outputs

def expensive_external_processing(input, cache):
    if input not in cache:
        cache[input] = subprocess.run(...).stdout
    return cache[input]

And experience with C makes me wary of race conditions damaging the cache.

What's the best, most Pythonic way to parallelize this stage of processing? I'd like to keep the memoization in place, because removing it increases the runtime by a factor of four.

You could launch the task asynchronously, then put the future into the memo. Anyone asking about the result from the memo will encounter one of three states: no memo (so launch a new expensive external processing task), unfulfilled future memo (you could wait for it, or acknowledge it's not done yet and go do something else till it's done), or fulfilled future memo (the result is immediately available). This way, you can avoid issuing several identical requests before the processing for it is done. Note that futures are only available since 3.5.

You could also see why the task is taking so long. If the calculation is heavy, there's no way around that; but if the startup is heavy (which was very often my experience when doing things like this. In this case, it is very useful to wrap the other executable into something that has a loop and can communicate (most easily, a web service). This lets you have the true per-request cost, completely avoiding the startup cost that you get by spawning a new subprocess for each request.

您可以使用多重处理来并行启动功能,然后使用multiprocessing.Queue使高速缓存在各个进程之间保持同步。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM