简体   繁体   English

python multiprocessing-如何对中期结果采取行动

[英]python multiprocessing - how to act on interim results

I'm using pandas to calculate statistics etc on a lot of data but it ends up running for hours, and I get new data frequently. 我正在使用熊猫来计算大量数据的统计信息,但最终运行了几个小时,而且我经常得到新数据。 I've tried to optimize already but I'd like to make it faster, so I'm trying to make it use multiple processes. 我已经尝试过优化,但是我想使其更快,所以我试图使其使用多个过程。 The problem I'm having is that I need to perform some interim work with the results as they're getting done, and the examples I've seen for multiprocessing.Process and Pool all wait for everything to finish before working with the results. 我遇到的问题是,我需要在结果完成时进行一些临时工作,并且已经看到了用于multiprocessing.Process的示例multiprocessing.ProcessPool在处理结果之前都等待所有事情完成。

This is the heavily trimmed code I'm using now. 这是我现在正在使用的大量精简代码。 The piece I want to put into separate processes is generateAnalytics(). 我要放在单独的进程中的片段是generateAnalytics()。

for counter, symbol in enumerate(queuelist):  # queuelist
    if needQueueLoad:  # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
        log.info('Shutting down analyticsRunner thread')
        break
    dfDay = generateAnalytics(symbol)  # slow running function (15s+)
    astore[analyticsTable(symbol)] = dfDay  # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
    dfLatest.loc[symbol] = dfDay.iloc[-1]  # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)

    log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
    # do some stuff to update progress GUI 

I can't figure out how to get the last lines to work with the results while it's ongoing and would appreciate suggestions. 我无法弄清楚在进行过程中如何使最后几行与结果配合使用,希望能提出一些建议。

I'm considering running it all in a Pool and having the processes add the results to a Queue (instead of returning them), and then have a while loop sit in the main process pulling off the queue as the results come in - would that be a reasonable way to do it? 我正在考虑将其全部在Pool运行,并让进程将结果添加到Queue (而不是返回它们),然后在主进程中放置一会儿循环,以便在结果进入时退出队列-这样是一个合理的方法吗? Something like: 就像是:

mpqueue = multiprocessing.Queue()
pool = multiprocessing.Pool()
pool.map(generateAnalytics, [queuelist, mpqueue])

while not needQueueLoad:  # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
    while not mpqueue.empty():
        dfDay = mpqueue.get()
        astore[analyticsTable(symbol)] = dfDay  # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
        dfLatest.loc[symbol] = dfDay.iloc[-1]  # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)    
        log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
        # do some stuff to update GUI that shows progress            
    sleep(0.1)
    # do some bookkeeping to see if queue has finished
pool.join()

Using a Queue looks like a reasonable way to do it, with two remarks. 使用Queue似乎是一种合理的方法,但有两个说明。

  1. Since it looks from the code that you're using a GUI, checking for results is probably better done in a timeout function or idle function rather than in a while-loop. 由于它是从您正在使用GUI的代码中查找的,因此检查结果可能最好在超时函数或空闲函数中进行,而不是在while循环中进行。 Using a while-loop to check for results would block the GUI's event loop. 使用while循环检查结果将阻止GUI的事件循环。

  2. If the worker processes need to return a lot of data to the main process via the Queue, this will add significant overhead. 如果工作进程需要通过队列将大量数据返回到主进程,则这将增加大量开销。 You might want to consider using shared memory or even an intermediate file. 您可能要考虑使用共享内存,甚至是中间文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM