简体   繁体   English

使用python multiprocessing在异步中启动大量依赖进程

[英]Starting a large number of dependent process in async using python multiprocessing

Problem: I've a DAG(Directed-acyclic-graph) like structure for starting the execution of some massive data processing on a machine. 问题:我有一个类似于DAG(有向无环图)的结构,用于开始在计算机上执行大量数据处理。 Some of the process can only be started when their parent data processing is completed cause there is multi level of processing. 有些进程只能在其父数据处理完成后才能启动,因为存在多层处理。 I want to use python multiprocessing library to handle all on one single machine of it as first goal and later scale to execute on different machines using Managers. 我想使用python多处理库来处理它在单个计算机上的所有问题,并将其扩展到使用Managers在不同的计算机上执行。 I've got no prior experience with python multiprocessing. 我没有python多处理的经验。 Can anyone suggest if it's a good library to begin with? 任何人都可以建议它是否是一个好的图书馆? If yes, some basic implementation idea would do just fine. 如果是,那么一些基本的实现想法就可以了。 If not, what else can be used to do this thing in python? 如果没有,还有什么可以用来在python中做这件事?

Example: 例:

A -> B A-> B

B -> D, E, F, G B-> D,E,F,G

C -> D C-> D

In the above example i want to kick A & C first(parallel), after their successful execution, other remaining processes would just wait for B to finish first. 在上面的示例中,我想先踢A和C(成功执行),其他命令成功执行后,其他剩余进程将只等B先完成。 As soon as B finishes its execution all other process will start. 一旦B完成执行,所有其他过程将开始。

PS: Sorry i cannot share actual data because confidential, though i tried to make it clear using the example. PS:对不起,由于机密,我无法共享实际数据,尽管我试图通过示例进行说明。

I'm a big fan of using processes and queues for things like this. 我非常喜欢将进程和队列用于此类事情。

Like so: 像这样:

from multiprocessing import Process, Queue
from Queue import Empty as QueueEmpty
import time

#example process functions
def processA(queueA, queueB):
    while True:
        try:
            data = queueA.get_nowait()
            if data == 'END':
                break
        except QueueEmpty:
            time.sleep(2) #wait some time for data to enter queue
            continue
        #do stuff with data
        queueB.put(data)

def processA(queueB, _):
    while True:
        try:
            data = queueB.get_nowait()
            if data == 'END':
                break
        except QueueEmpty:
            time.sleep(2) #wait some time for data to enter queue
            continue
        #do stuff with data

#helper functions for starting and stopping processes
def start_procs(num_workers, target_function, args):
    procs = []
    for _ in range(num_workers):
        p = Process(target=target_function, args=args)
        p.start()
        procs.append(p)
    return procs

def shutdown_process(proc_lst, queue):
    for _ in proc_lst:
        queue.put('END')
    for p in proc_lst:
        try:
            p.join()
        except KeyboardInterrupt:
            break

queueA = Queue(<size of queue> * 3) #needs to be a bit bigger than actual. 3x works well for me
queueB = Queue(<size of queue>)
queueC = Queue(<size of queue>)
queueD = Queue(<size of queue>)

procsA = start_procs(number_of_workers, processA, (queueA, queueB)) 
procsB = start_procs(number_of_workers, processB, (queueB, None))  

# feed some data to processA
[queueA.put(data) for data in start_data]  

#shutdown processes
shutdown_process(procsA, queueA)
shutdown_process(procsB, queueB)

#etc, etc. You could arrange the start, stop, and data feed statements to arrive at the dag behaviour you desire

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM