简体   繁体   English

无需等待块完成即可处理大量数据

[英]Process a lot of data without waiting for a chunk to finish

I am confused with map , imap , apply_async , apply , Process etc from the multiprocessing python package.我对来自multiprocessing python package 的mapimapapply_asyncapplyProcess等感到困惑。

What I would like to do:我想做的事:

I have 100 simulation script files that need to be run through a simulation program.我有100 个需要通过模拟程序运行的模拟脚本文件 I would like python to run as many as it can in parallel, then as soon as one is finished, grab a new script and run that one .我希望 python 尽可能多地并行运行,然后在完成后立即获取一个新脚本并运行该脚本 I don't want any waiting.我不想等待。

Here is a demo code:这是一个演示代码:

import multiprocessing as mp  
import time

def run_sim(x):
    # run 
    print("Running Sim: ", x)
    
    # artificailly wait 5s
    time.sleep(5)
    
    
    return x

def main():
    # x => my simulation files
    x = list(range(100))
    # run parralel process
    pool = mp.Pool(mp.cpu_count()-1)
    # get results
    result = pool.map(run_sim, x)

    print("Results: ", result)
    
    

if __name__ == "__main__":
  main()

However, I don't think that map is the correct way here since I want the PC not to wait for the batch to be finished but immediately proceed to the next simulation file.但是,我不认为 map 是正确的方法,因为我希望 PC 不要等待批处理完成,而是立即进入下一个模拟文件。

The code will run mp.cpu_count()-1 simulations at the same time and then wait for every one of them to be finished, before starting a new batch of size mp.cpu_count()-1 .代码将同时运行mp.cpu_count()-1模拟,然后等待每个模拟完成,然后再开始新的一批大小mp.cpu_count()-1 I don't want the code to wait, but just to grab a new simulation file as soon as possible.我不希望代码等待,而只是尽快获取一个新的模拟文件。

在此处输入图像描述

Do you have any advice on how to code it better?你对如何更好地编码有什么建议吗?

Some clarifications:一些澄清:

I am reducing the pool to one less than the CPU count because I don't want to block the PC.我将池减少到比 CPU 计数少一,因为我不想阻塞 PC。 I still need to do light work while the code is running.在代码运行时,我仍然需要做一些轻松的工作。

It works correctly using map.它使用 map 可以正常工作。 The trouble is simply that you sleep all thread for 5 seconds, so they all finish at the same time.问题只是你让所有线程休眠 5 秒,所以它们都同时完成。

Try this code to see the effect correctly:试试这段代码,看看效果是否正确:

import multiprocessing as mp  
import time
import random

def run_sim(x):
    # run 
    t = random.randint(3,10)
    print("Running Sim: ", x, " - sleep ", t)
    time.sleep(t)
            
    return x

def main():
    # x => my simulation files
    x = list(range(100))
    # run parralel process
    pool = mp.Pool(mp.cpu_count()-1)
    # get results
    result = pool.map(run_sim, x)

    print("Results: ", result)

if __name__ == "__main__":
  main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM