简体   繁体   English

Python多处理池; 等待迭代完成

[英]Python multiprocessing pool; wait for iteration to complete

I've got a large dataset which I want my script to iterate through, perform a series of operations on each entry, then arrange the results for storage to HDD. 我有一个很大的数据集,我希望我的脚本可以迭代通过,对每个条目执行一系列操作,然后将结果排列以存储到HDD。 Because the datasets can be relatively large (~250 GB), RAM availability requires that the dataset be processed in chunks (what I've called dataBlock in the code below) of 1000 entries at a time. 由于数据集可能相对较大(〜250 GB),因此RAM的可用性要求数据集必须一次处理1000个条目的块(在下面的代码中我称之为dataBlock)。 I also use the multiprocessing.Pool class to facilitate use of multiple CPU cores for this task. 我还使用multiprocessing.Pool类来促进使用多个CPU内核来完成此任务。

I've essentially got things arranged so that each dataBlock is passed to the Pool, the Pool executes the desired calculations on the dataBlock using the imap method, the Pool returns the calculation results, and the results for the data chunk are appended to a list. 我基本上已经安排好了一切,以便将每个dataBlock传递给Pool,Pool使用imap方法对dataBlock执行所需的计算,Pool返回计算结果,并将数据块的结果附加到列表中。 This list ( processed_data ) is the desired end product of the set of calculations. 此列表( processed_data )是该组计算的所需最终产品。

processed_data = []

multiprocessing.Pool(processor_cap) as pool:

    for blockIndex, block in enumerate(range(1000, height-remainder, 1000)):

        #Read-in 1000 spectra from source dataset
        dataBlock = np.asarray(raw_dset[blockIndex*1000:block][:])

        '''
        Pass data block to processor pool, which iterates through data
        block. Each spectrum is handed off to a CPU in the pool,
        which centroids it and appends the result to "processed_block".
        '''
        processed_block = pool.imap(centroid_spectrum, dataBlock)

        #Append processed spectra to processed data bin
        for idx, processed_spectrum in enumerate(processed_block):
            processed_data.append(processed_spectrum)

What I'd like to know is how to make the script pause after the call of pool.imap() until the full processed_block has been returned without closing the pool. 我想知道的是如何使脚本暂停的呼叫后pool.imap()直到完全processed_block不关闭池已恢复。 Currently, it progresses right into the for loop which immediately follows in the snippit above without waiting for processed_block to be returned by pool.imap . 目前,它的进展对入for环路立即在的这段遵循以上无需等待processed_block由返回pool.imap I've tried calling pool.join() immediately after the pool.imap() call, but it only returns ***AssertionError and again continues to the for loop below it. 我已经尝试在pool.imap()调用之后立即调用pool.join() ,但是它仅返回***AssertionError并再次继续其下面的for循环。 I can eventually successfully call pool.close() and pool.join() later in the script once all dataBlocks have been fed to the pool, just below the end of the outermost for loop above. 一旦将所有dataBlocks馈入池中,就在上面最外层for循环的末尾,我最终可以在脚本中稍后成功地调用pool.close()pool.join()

Thanks in advance for helping! 在此先感谢您的帮助!

It's difficult to work with your example, without a lot of effort to change things around; 如果不付出很多努力来改变周围的事物,那么使用示例很难。 but if you have an Iterator from the imap() call, then you might consider resolving the element of the iterator to a list before you reach the for loop: 但是,如果您从imap()调用中获得了迭代器,则可以考虑在到达for循环之前将迭代器的元素解析为列表:

processed_block = pool.imap(centroid_spectrum, dataBlock)
processed_block = [ x for x in processed_block ] # convert from an iterator to a list
for idx, processed_spectrum in enumerate(processed_block):

etc. 等等

Does that achieve what you wanted? 这样能达到您想要的吗?

I simply changed the Pool.imap() call to a Pool.map() call, and the script ran as intended. 我只是将Pool.imap()调用更改为Pool.map()调用,并且脚本按预期运行。 See my exchange with Mikhail Burshteyn for more info. 有关更多信息,请参见与Mikhail Burshteyn的交流。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM