无需等待块完成即可处理大量数据

Question

I am confused with map , imap , apply_async , apply , Process etc from the multiprocessing python package.我对来自multiprocessing python package 的map 、 imap 、 apply_async 、 apply 、 Process等感到困惑。

What I would like to do:我想做的事：

I have 100 simulation script files that need to be run through a simulation program.我有100 个需要通过模拟程序运行的模拟脚本文件。 I would like python to run as many as it can in parallel, then as soon as one is finished, grab a new script and run that one .我希望 python 尽可能多地并行运行，然后在完成后立即获取一个新脚本并运行该脚本。 I don't want any waiting.我不想等待。

Here is a demo code:这是一个演示代码：

import multiprocessing as mp  
import time

def run_sim(x):
    # run 
    print("Running Sim: ", x)
    
    # artificailly wait 5s
    time.sleep(5)
    
    
    return x

def main():
    # x => my simulation files
    x = list(range(100))
    # run parralel process
    pool = mp.Pool(mp.cpu_count()-1)
    # get results
    result = pool.map(run_sim, x)

    print("Results: ", result)
    
    

if __name__ == "__main__":
  main()

However, I don't think that map is the correct way here since I want the PC not to wait for the batch to be finished but immediately proceed to the next simulation file.但是，我不认为 map 是正确的方法，因为我希望 PC 不要等待批处理完成，而是立即进入下一个模拟文件。

The code will run mp.cpu_count()-1 simulations at the same time and then wait for every one of them to be finished, before starting a new batch of size mp.cpu_count()-1 .代码将同时运行mp.cpu_count()-1模拟，然后等待每个模拟完成，然后再开始新的一批大小mp.cpu_count()-1 。 I don't want the code to wait, but just to grab a new simulation file as soon as possible.我不希望代码等待，而只是尽快获取一个新的模拟文件。

Do you have any advice on how to code it better?你对如何更好地编码有什么建议吗？

Some clarifications:一些澄清：

I am reducing the pool to one less than the CPU count because I don't want to block the PC.我将池减少到比 CPU 计数少一，因为我不想阻塞 PC。 I still need to do light work while the code is running.在代码运行时，我仍然需要做一些轻松的工作。

Answer 1

It works correctly using map.它使用 map 可以正常工作。 The trouble is simply that you sleep all thread for 5 seconds, so they all finish at the same time.问题只是你让所有线程休眠 5 秒，所以它们都同时完成。

Try this code to see the effect correctly:试试这段代码，看看效果是否正确：

import multiprocessing as mp  
import time
import random

def run_sim(x):
    # run 
    t = random.randint(3,10)
    print("Running Sim: ", x, " - sleep ", t)
    time.sleep(t)
            
    return x

def main():
    # x => my simulation files
    x = list(range(100))
    # run parralel process
    pool = mp.Pool(mp.cpu_count()-1)
    # get results
    result = pool.map(run_sim, x)

    print("Results: ", result)

if __name__ == "__main__":
  main()

无需等待块完成即可处理大量数据

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-01-18 08:27:28

无需等待块完成即可处理大量数据

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-01-18 08:27:28

解决方案1
3 已采纳 2021-01-18 08:27:28