简体   繁体   English

多处理通过共享内存传递一系列dicts

[英]Multiprocessing passing an array of dicts through shared memory

The following code works, but it is very slow due to passing the large data sets. 以下代码有效,但由于传递了大型数据集,因此速度非常慢。 In the actual implementation, the speed it takes to create the process and send the data is almost the same as calculation time, so by the time the second process is created, the first process is almost finished with the calculation, making parallezation? 在实际实现中,创建流程和发送数据所需的速度与计算时间几乎相同,因此在创建第二个流程时,第一个流程几乎完成了计算,并行化了? pointless. 无意义。

The code is the same as in this question Multiprocessing has cutoff at 992 integers being joined as result with the suggested change working and implemented below. 代码与此问题中的代码相同。 多处理已将992个整数的截止结果与建议的更改一起工作并在下面实现。 However, I ran into the common problem as others with I assume, pickling large data taking a long time. 然而,我遇到了常见问题,因为我认为其他人需要长时间腌制大量数据。

I see answers using the multiprocessing.array to pass a shared memory array. 我看到使用multiprocessing.array传递共享内存数组的答案。 I have an array of ~4000 indexes, but each index has a dictionary with 200 key/value pairs. 我有一个~4000个索引的数组,但每个索引都有一个包含200个键/值对的字典。 The data is just read by each process, some calculation is done, and then an matrix (4000x3) (with no dicts) is returned. 每个过程只读取数据,完成一些计算,然后返回矩阵(4000x3)(没有dicts)。

Answers like this Is shared readonly data copied to different processes for Python multiprocessing? 像这样的答案共享只读数据复制到Python多处理的不同进程? use map. 使用地图。 Is it possible to maintain the below system and implement shared memory? 是否可以维护以下系统并实现共享内存? Is there an efficient way to send the data to each process with an array of dicts, such as wrapping the dict in some manager and then putting that inside of the multiprocessing.array ? 是否有一种有效的方法将数据发送到每个进程并使用一系列dicts,例如将dict包装在某个管理器中,然后将其放入multiprocessing.array中?

import multiprocessing

def main():
    data = {}
    total = []
    for j in range(0,3000):
        total.append(data)
        for i in range(0,200):
            data[str(i)] = i

    CalcManager(total,start=0,end=3000)

def CalcManager(myData,start,end):
    print 'in calc manager'
    #Multi processing
    #Set the number of processes to use.  
    nprocs = 3
    #Initialize the multiprocessing queue so we can get the values returned to us
    tasks = multiprocessing.JoinableQueue()
    result_q = multiprocessing.Queue()
    #Setup an empty array to store our processes
    procs = []
    #Divide up the data for the set number of processes 
    interval = (end-start)/nprocs 
    new_start = start
    #Create all the processes while dividing the work appropriately
    for i in range(nprocs):
        print 'starting processes'
        new_end = new_start + interval
        #Make sure we dont go past the size of the data 
        if new_end > end:
            new_end = end 
        #Generate a new process and pass it the arguments 
        data = myData[new_start:new_end]
        #Create the processes and pass the data and the result queue
        p = multiprocessing.Process(target=multiProcess,args=(data,new_start,new_end,result_q,i))
        procs.append(p)
        p.start()
        #Increment our next start to the current end 
        new_start = new_end+1
    print 'finished starting'    

    #Print out the results
    for i in range(nprocs):
        result = result_q.get()
        print result

    #Joint the process to wait for all data/process to be finished
    for p in procs:
        p.join()

#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
    print 'started process'
    results = []
    temp = []
    for i in range(0,22):
        results.append(temp)
        for j in range(0,3):
            temp.append(j)
    result_q.put(results)
    return

if __name__== '__main__':   
    main()

Solved 解决了

by just putting the list of dictionaries into a manager, the problem was solved. 只需将字典列表放入管理器中,问题就解决了。

manager=Manager()
d=manager.list(myData)

It seems that the manager holding the list also manages the dict contained by that list. 似乎持有列表的经理也管理该列表包含的字典。 The startup time is a bit slow, so it seems data is still being copied, but its done once at the beginning and then inside of the process the data is sliced. 启动时间有点慢,所以似乎数据仍然被复制,但它在开始时完成一次,然后在数据被切片的过程内部完成。

import multiprocessing
import multiprocessing.sharedctypes as mt
from multiprocessing import Process, Lock, Manager
from ctypes import Structure, c_double

def main():
    data = {}
    total = []
    for j in range(0,3000):
        total.append(data)
        for i in range(0,100):
            data[str(i)] = i

    CalcManager(total,start=0,end=500)

def CalcManager(myData,start,end):
    print 'in calc manager'
    print type(myData[0])

    manager = Manager()
    d = manager.list(myData)

    #Multi processing
    #Set the number of processes to use.  
    nprocs = 3
    #Initialize the multiprocessing queue so we can get the values returned to us
    tasks = multiprocessing.JoinableQueue()
    result_q = multiprocessing.Queue()
    #Setup an empty array to store our processes
    procs = []
    #Divide up the data for the set number of processes 
    interval = (end-start)/nprocs 
    new_start = start
    #Create all the processes while dividing the work appropriately
    for i in range(nprocs):
        new_end = new_start + interval
        #Make sure we dont go past the size of the data 
        if new_end > end:
            new_end = end 
        #Generate a new process and pass it the arguments 
        data = myData[new_start:new_end]
        #Create the processes and pass the data and the result queue
        p = multiprocessing.Process(target=multiProcess,args=(d,new_start,new_end,result_q,i))
        procs.append(p)
        p.start()
        #Increment our next start to the current end 
        new_start = new_end+1
    print 'finished starting'    

    #Print out the results
    for i in range(nprocs):
        result = result_q.get()
        print len(result)

    #Joint the process to wait for all data/process to be finished
    for p in procs:
        p.join()

#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
    #print 'started process'
    results = []
    temp = []
    data = data[start:end]
    for i in range(0,22):
        results.append(temp)
        for j in range(0,3):
            temp.append(j)
    print len(data)        
    result_q.put(results)
    return

if __name__ == '__main__':
    main()

Looking at your question, I assume the following: 看看你的问题,我假设如下:

  • For each item in myData , you want to return an output (a matrix of some sort) 对于myData每个项目,您想要返回一个输出(某种矩阵)
  • You created a JoinableQueue ( tasks ) probably for holding the input, but not sure how to use it 您创建了一个JoinableQueue( tasks ),可能用于保存输入,但不确定如何使用它

The Code 代码

import logging
import multiprocessing


def create_logger(logger_name):
    ''' Create a logger that log to the console '''
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)

    # create console handler and set appropriate level
    ch = logging.StreamHandler()
    formatter = logging.Formatter("%(processName)s %(funcName)s() %(levelname)s: %(message)s")
    ch.setFormatter(formatter)
    logger.addHandler(ch)
    return logger

def main():
    global logger
    logger = create_logger(__name__)
    logger.info('Main started')
    data = []
    for i in range(0,100):
        data.append({str(i):i})

    CalcManager(data,start=0,end=50)
    logger.info('Main ended')

def CalcManager(myData,start,end):
    logger.info('CalcManager started')
    #Initialize the multiprocessing queue so we can get the values returned to us
    tasks = multiprocessing.JoinableQueue()
    results = multiprocessing.Queue()

    # Add tasks
    for i in range(start, end):
        tasks.put(myData[i])

    # Create processes to do work
    nprocs = 3
    for i in range(nprocs):
        logger.info('starting processes')
        p = multiprocessing.Process(target=worker,args=(tasks,results))
        p.daemon = True
        p.start()

    # Wait for tasks completion, i.e. tasks queue is empty
    try:
        tasks.join()
    except KeyboardInterrupt:
        logger.info('Cancel tasks')

    # Print out the results
    print 'RESULTS'
    while not results.empty():
        result = results.get()
        print result

    logger.info('CalManager ended')

def worker(tasks, results):
    while True:
        try:
            task = tasks.get()  # one row of input
            task['done'] = True # simular work being done
            results.put(task)   # Save the result to the output queue
        finally:
            # JoinableQueue: for every get(), we need a task_done()
            tasks.task_done()


if __name__== '__main__':   
    main()

Discussion 讨论

  • For multiple process situation, I recommend using the logging module as it offer a few advantages: 对于多个进程情况,我建议使用logging模块,因为它提供了一些优点:
    • It is thread- and process- safe; 它是线程和过程安全的; meaning you won't have situation where the output of one processes mingle together 这意味着您不会遇到一个进程的输出混合在一起的情况
    • You can configure logging to show the process name, function name--very handy for debugging 您可以配置日志记录以显示进程名称,函数名称 - 非常便于调试
  • CalcManager is essentially a task manager which does the following CalcManager本质上是一个任务管理器,它执行以下操作
    1. Creates three processes 创建三个进程
    2. Populate the input queue, tasks 填充输入队列, tasks
    3. Waits for the task completion 等待任务完成
    4. Prints out the result 打印出结果
  • Note that when creating processes, I mark them as daemon , meaning they will killed when the main program exits. 请注意,在创建进程时,我将它们标记为守护进程 ,这意味着它们将在主程序退出时被终止。 You don't have to worry about killing them 你不必担心杀死它们
  • worker is where the work is done worker是工作的地方
    • Each of them runs forever ( while True loop) 它们中的每一个都永远运行( while True循环)
    • Each time through the loop, they will get one unit of input, do some processing, then put the result in the output 每次循环时,它们将获得一个输入单位,进行一些处理,然后将结果放入输出中
    • After a task is done, it calls task_done() so that the main process knows when all jobs are done. 任务完成后,它会调用task_done()以便主进程知道所有作业何时完成。 I put task_done in the finally clause to ensure it will run even if an error occurred during processing 我将task_done放在finally子句中,以确保即使在处理过程中发生错误也会运行它

You may see some improvement by using a multiprocessing.Manager to store your list in a manager server, and having each child process access items from the dict by pulling them from that one shared list, rather than copying slices to each child process: 您可以通过使用multiprocessing.Manager将列表存储在管理器服务器中,让每个子进程通过从一个共享列表中提取来访问dict中的项目,而不是将切片复制到每个子进程,从而看到一些改进:

def CalcManager(myData,start,end):
    print 'in calc manager'
    print type(myData[0])

    manager = Manager()
    d = manager.list(myData)

    nprocs = 3 
    result_q = multiprocessing.Queue()
    procs = []

    interval = (end-start)/nprocs 
    new_start = start

    for i in range(nprocs):
        new_end = new_start + interval
        if new_end > end:
            new_end = end 
        p = multiprocessing.Process(target=multiProcess,
                                    args=(d, new_start, new_end, result_q, i))
        procs.append(p)
        p.start()
        #Increment our next start to the current end 
        new_start = new_end+1
    print 'finished starting'        

    for i in range(nprocs):
        result = result_q.get()
        print len(result)

    #Joint the process to wait for all data/process to be finished
    for p in procs:
        p.join()

This copies your entire data list to a Manager process prior to creating any of your workers. 这会在创建任何工作人员之前将整个data列表复制到Manager进程。 The Manager returns a Proxy object that allows shared access to the list . Manager返回一个允许对list进行共享访问的Proxy对象。 You then just pass the Proxy to the workers, which means their startup time will be greatly reduced, since there's no longer any need to copy slices of the data list. 然后,您只需将Proxy传递给工作人员,这意味着他们的启动时间将大大减少,因为不再需要复制data列表的片段。 The downside here is that accessing the list will be slower in the children, since the access needs to go to the manager process via IPC. 这里的缺点是访问列表的孩子会更慢,因为访问需要通过IPC进入管理器进程。 Whether or not this will really help performance is very dependent on exactly what work you're doing on the list in your work processes, but its worth a try, since it requires very few code changes. 这是否真的有助于提高性能在很大程度上取决于您在工作流程list中正在做什么工作,但值得一试,因为它只需要很少的代码更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM