如何同时使用多处理池和队列？

Question

我需要在超级计算机上执行〜18000有点昂贵的计算，而我试图找出如何并行化代码。 我主要将其与multiprocessing.Process一起使用，但是如果我进行了超过350次的计算，它将挂在.join（）步骤上。

一位管理超级计算机的计算机科学家建议我使用multiprocessing.Pool而不是Process。

使用Process时，我将设置一个输出Queue和一个进程列表，然后运行并加入这样的进程：

output = mp.Queue()
processes = [mp.Process(target=some_function,args=(x,output)) for x in some_array]
for p in processes:
    p.start()
for p in processes:
    p.join()

因为processes是一个列表，所以它是可迭代的，并且我可以在列表output.get()中使用output.get()来获取所有结果：

result = [output.get() for p in processes]

使用池时，这等效于什么？ 如果池不是可迭代的，我如何获取其中的每个进程的输出？

这是我对虚拟数据和虚拟计算的尝试：

import pandas as pd
import multiprocessing as mp

##dummy function
def predict(row,output):
    calc = [len(row.c1)**2,len(row.c2)**2]
    output.put([row.c1+' - '+row.c2,sum(calc)])

#dummy data
c = pd.DataFrame(data=[['a','bb'],['ccc','dddd'],['ee','fff'],['gg','hhhh'],['i','jjj']],columns=['c1','c2'])

if __name__ == '__main__':
    #output queue
    print('initializing output container...')
    output = mp.Manager().Queue()


    #pool of processes
    print('initializing and storing calculations...')
    pool = mp.Pool(processes=5)
    for i,row in c.iterrows(): #try some smaller subsets here
         pool.apply_async(predict,args=(row,output))

    #run processes and keep a counter-->I'm not sure what replaces this with Pool!
    #for p in processes:
    #    p.start()

    ##exit completed processes-->or this!
    #for p in processes:
    #    p.join()

    #pool.close() #is this right?
    #pool.join() #this?

#store each calculation
print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
print(p)

我得到的输出是：

initializing output container...
initializing and storing calculations...
storing output of calculations...
Traceback (most recent call last):
  File "parallel_test.py", line 37, in <module>
    p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
TypeError: 'Pool' object is not iterable

我想要的是p打印并看起来像：

        0   1
0      a - bb   5
1  ccc - dddd  25
2    ee - fff  13
3   gg - hhhh  20
4     i - jjj  10

如何从每个计算中获得输出，而不仅仅是第一个？

Answer 1

即使您将所有有用的结果存储在队列output您output.get()通过调用output.get()来获取结果在存储在output次数len(c)在您的情况下为训练示例的数量len(c) ）。 对我来说，如果您更改行，它将起作用：

print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable

至：

print('storing output of calculations...')
    p = pd.DataFrame([output.get() for _ in range(len(c))]) ## <-- no longer breaks

如何同时使用多处理池和队列？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-15 05:24:28

如何同时使用多处理池和队列？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-15 05:24:28

解决方案1
0 已采纳 2018-12-15 05:24:28