简体   繁体   中英

How do I use multiprocessing Pool and Queue together?

I need to perform ~18000 somewhat expensive calculations on a supercomputer and I'm trying to figure out how to parallelize the code. I had it mostly working with multiprocessing.Process but it would hang at the .join() step if I did more than ~350 calculations.

One of the computer scientists managing the supercomputer recommended I use multiprocessing.Pool instead of Process.

When using Process, I would set up an output Queue and a list of processes, then run and join the processes like this:

output = mp.Queue()
processes = [mp.Process(target=some_function,args=(x,output)) for x in some_array]
for p in processes:
    p.start()
for p in processes:
    p.join()

Because processes is a list, it is iterable, and I can use output.get() inside a list comprehension to get all the results:

result = [output.get() for p in processes]

What is the equivalent of this when using a Pool? If the Pool is not iterable, how can I get the output of each process that is inside it?

Here is my attempt with dummy data and a dummy calculation:

import pandas as pd
import multiprocessing as mp

##dummy function
def predict(row,output):
    calc = [len(row.c1)**2,len(row.c2)**2]
    output.put([row.c1+' - '+row.c2,sum(calc)])

#dummy data
c = pd.DataFrame(data=[['a','bb'],['ccc','dddd'],['ee','fff'],['gg','hhhh'],['i','jjj']],columns=['c1','c2'])

if __name__ == '__main__':
    #output queue
    print('initializing output container...')
    output = mp.Manager().Queue()


    #pool of processes
    print('initializing and storing calculations...')
    pool = mp.Pool(processes=5)
    for i,row in c.iterrows(): #try some smaller subsets here
         pool.apply_async(predict,args=(row,output))

    #run processes and keep a counter-->I'm not sure what replaces this with Pool!
    #for p in processes:
    #    p.start()

    ##exit completed processes-->or this!
    #for p in processes:
    #    p.join()

    #pool.close() #is this right?
    #pool.join() #this?

#store each calculation
print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
print(p)

The output I get is:

initializing output container...
initializing and storing calculations...
storing output of calculations...
Traceback (most recent call last):
  File "parallel_test.py", line 37, in <module>
    p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
TypeError: 'Pool' object is not iterable

What I want is for p to print and look like:

        0   1
0      a - bb   5
1  ccc - dddd  25
2    ee - fff  13
3   gg - hhhh  20
4     i - jjj  10

How do I get the output from each calculation instead of just the first one?

Even though you store all your useful results in the queue output you want to fetch the results via calling output.get() the number of times it was stored in the output (number of training examples - len(c) in your case). For me it works if you change the line:

print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable

to:

print('storing output of calculations...')
    p = pd.DataFrame([output.get() for _ in range(len(c))]) ## <-- no longer breaks 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM