简体   繁体   中英

Python multiprocessing pool; wait for iteration to complete

I've got a large dataset which I want my script to iterate through, perform a series of operations on each entry, then arrange the results for storage to HDD. Because the datasets can be relatively large (~250 GB), RAM availability requires that the dataset be processed in chunks (what I've called dataBlock in the code below) of 1000 entries at a time. I also use the multiprocessing.Pool class to facilitate use of multiple CPU cores for this task.

I've essentially got things arranged so that each dataBlock is passed to the Pool, the Pool executes the desired calculations on the dataBlock using the imap method, the Pool returns the calculation results, and the results for the data chunk are appended to a list. This list ( processed_data ) is the desired end product of the set of calculations.

processed_data = []

multiprocessing.Pool(processor_cap) as pool:

    for blockIndex, block in enumerate(range(1000, height-remainder, 1000)):

        #Read-in 1000 spectra from source dataset
        dataBlock = np.asarray(raw_dset[blockIndex*1000:block][:])

        '''
        Pass data block to processor pool, which iterates through data
        block. Each spectrum is handed off to a CPU in the pool,
        which centroids it and appends the result to "processed_block".
        '''
        processed_block = pool.imap(centroid_spectrum, dataBlock)

        #Append processed spectra to processed data bin
        for idx, processed_spectrum in enumerate(processed_block):
            processed_data.append(processed_spectrum)

What I'd like to know is how to make the script pause after the call of pool.imap() until the full processed_block has been returned without closing the pool. Currently, it progresses right into the for loop which immediately follows in the snippit above without waiting for processed_block to be returned by pool.imap . I've tried calling pool.join() immediately after the pool.imap() call, but it only returns ***AssertionError and again continues to the for loop below it. I can eventually successfully call pool.close() and pool.join() later in the script once all dataBlocks have been fed to the pool, just below the end of the outermost for loop above.

Thanks in advance for helping!

It's difficult to work with your example, without a lot of effort to change things around; but if you have an Iterator from the imap() call, then you might consider resolving the element of the iterator to a list before you reach the for loop:

processed_block = pool.imap(centroid_spectrum, dataBlock)
processed_block = [ x for x in processed_block ] # convert from an iterator to a list
for idx, processed_spectrum in enumerate(processed_block):

etc.

Does that achieve what you wanted?

I simply changed the Pool.imap() call to a Pool.map() call, and the script ran as intended. See my exchange with Mikhail Burshteyn for more info.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM