简体   繁体   中英

python multiprocessing - how to act on interim results

I'm using pandas to calculate statistics etc on a lot of data but it ends up running for hours, and I get new data frequently. I've tried to optimize already but I'd like to make it faster, so I'm trying to make it use multiple processes. The problem I'm having is that I need to perform some interim work with the results as they're getting done, and the examples I've seen for multiprocessing.Process and Pool all wait for everything to finish before working with the results.

This is the heavily trimmed code I'm using now. The piece I want to put into separate processes is generateAnalytics().

for counter, symbol in enumerate(queuelist):  # queuelist
    if needQueueLoad:  # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
        log.info('Shutting down analyticsRunner thread')
        break
    dfDay = generateAnalytics(symbol)  # slow running function (15s+)
    astore[analyticsTable(symbol)] = dfDay  # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
    dfLatest.loc[symbol] = dfDay.iloc[-1]  # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)

    log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
    # do some stuff to update progress GUI 

I can't figure out how to get the last lines to work with the results while it's ongoing and would appreciate suggestions.

I'm considering running it all in a Pool and having the processes add the results to a Queue (instead of returning them), and then have a while loop sit in the main process pulling off the queue as the results come in - would that be a reasonable way to do it? Something like:

mpqueue = multiprocessing.Queue()
pool = multiprocessing.Pool()
pool.map(generateAnalytics, [queuelist, mpqueue])

while not needQueueLoad:  # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
    while not mpqueue.empty():
        dfDay = mpqueue.get()
        astore[analyticsTable(symbol)] = dfDay  # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
        dfLatest.loc[symbol] = dfDay.iloc[-1]  # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)    
        log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
        # do some stuff to update GUI that shows progress            
    sleep(0.1)
    # do some bookkeeping to see if queue has finished
pool.join()

Using a Queue looks like a reasonable way to do it, with two remarks.

  1. Since it looks from the code that you're using a GUI, checking for results is probably better done in a timeout function or idle function rather than in a while-loop. Using a while-loop to check for results would block the GUI's event loop.

  2. If the worker processes need to return a lot of data to the main process via the Queue, this will add significant overhead. You might want to consider using shared memory or even an intermediate file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM