简体   繁体   中英

Creating a Queue delay in a Python pool without blocking

I have a large program (specifically, a function) that I'm attempting to parallelize using a JoinableQueue and the multiprocessing map_async method. The function that I'm working with does several operations on multidimensional arrays, so I break up each array into sections, and each section evaluates independently; however I need to stitch together one of the arrays early on, but the "stitch" happens before the "evaluate" and I need to introduce some kind of delay in the JoinableQueue. I've searched all over for a workable solution but I'm very new to multiprocessing and most of it goes over my head.

This phrasing may be confusing- apologies. Here's an outline of my code (I can't put all of it because it's very long, but I can provide additional detail if needed)

import numpy as np
import multiprocessing as mp
from multiprocessing import Pool, Pipe, JoinableQueue

def main_function(section_number):

    #define section sizes
    array_this_section = array[:,start:end+1,:]
    histogram_this_section = np.zeros((3, dataset_size, dataset_size))
    #start and end are defined according to the size of the array
    #dataset_size is to show that the histogram is a different size than the array

    for m in range(1,num_iterations+1):
        #do several operations- each section of the array 
                 #corresponds to a section on the histogram

        hist_queue.put(histogram_this_section)

        #each process sends their own part of the histogram outside of the pool 
                 #to be combined with every other part- later operations 
                 #in this function must use the full histogram

        hist_queue.join()
        full_histogram = full_hist_queue.get()
        full_hist_queue.task_done()

        #do many more operations


hist_queue = JoinableQueue()
full_hist_queue = JoinableQueue()

if __name__ == '__main__':
    pool = mp.Pool(num_sections)
    args = np.arange(num_sections)
    pool.map_async(main_function, args, chunksize=1)    

    #I need the map_async because the program is designed to display an output at the 
        #end of each iteration, and each output must be a compilation of all processes

    #a few variable definitions go here

    for m in range(1,num_iterations+1):
        for i in range(num_sections):
            temp_hist = hist_queue.get()    #the code hangs here because the queue 
                                            #is attempting to get before anything 
                                            #has been put
            hist_full += temp_hist
        for i in range(num_sections):
            hist_queue.task_done()
        for i in range(num_sections):
            full_hist_queue.put(hist_full)    #the full histogram is sent back into 
                                              #the pool


            full_hist_queue.join()

        #etc etc

    pool.close()
    pool.join()

I'm pretty sure that your issue is how you're creating the Queue s and trying to share them with the child processes. If you just have them as global variables, they may be recreated in the child processes instead of inherited (the exact details depend on what OS and/or context you're using for multiprocessing ).

A better way to go about solving this issue is to avoid using multiprocessing.Pool to spawn your processes and instead explicitly create Process instances for your workers yourself. This way you can pass the Queue instances to the processes that need them without any difficulty (it's technically possible to pass the queues to the Pool workers, but it's awkward).

I'd try something like this:

def worker_function(section_number, hist_queue, full_hist_queue): # take queues as arguments
    # ... the rest of the function can work as before
    # note, I renamed this from "main_function" since it's not running in the main process

if __name__ == '__main__':
    hist_queue = JoinableQueue()   # create the queues only in the main process
    full_hist_queue = JoinableQueue()  # the workers don't need to access them as globals

    processes = [Process(target=worker_function, args=(i, hist_queue, full_hist_queue)
                 for i in range(num_sections)]
    for p in processes:
        p.start()

    # ...

If the different stages of your worker function are more or less independent of one another (that is, the "do many more operations" step doesn't depend directly on the "do several operations" step above it, just on full_histogram ), you might be able to keep the Pool and instead split up the different steps into separate functions, which the main process could call via several calls to map on the pool. You don't need to use your own Queue s in this approach, just the ones built in to the Pool. This might be best especially if the number of "sections" you're splitting the work up into doesn't correspond closely with the number of processor cores on your computer. You can let the Pool match the number of cores, and have each one work on several sections of the data in turn.

A rough sketch of this would be something like:

def worker_make_hist(section_number):
    # do several operations, get a partial histogram
    return histogram_this_section

def worker_do_more_ops(section_number, full_histogram):
    # whatever...
    return some_result

if __name__ == "__main__":
    pool = multiprocessing.Pool() # by default the size will be equal to the number of cores

    for temp_hist in pool.imap_unordered(worker_make_hist, range(number_of_sections)):
        hist_full += temp_hist

    some_results = pool.starmap(worker_do_more_ops, zip(range(number_of_sections),
                                                        itertools.repeat(hist_full)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM