Multiprocess Python/Numpy code for processing data faster

Question

I am reading in hundreds of HDF files and processing the data of each HDF seperately. However, this takes an awful amount of time, since it is working on one HDF file at a time. I just stumbled upon http://docs.python.org/library/multiprocessing.html and am now wondering how I can speed things up using multiprocessing.

So far, I came up with this:

import numpy as np
from multiprocessing import Pool

def myhdf(date):
    ii      = dates.index(date)
    year    = date[0:4]
    month   = date[4:6]
    day     = date[6:8]
    rootdir = 'data/mydata/'
    filename = 'no2track'+year+month+day
    records = read_my_hdf(rootdir,filename)
    if records.size:
        results[ii] = np.mean(records)

dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))

pool = Pool(len(dates))
pool.map(myhdf,dates)

However, this is obviously not correct. Can you follow my chain of thought what I want to do? What do I need to change?

Answer 1

Try joblib for a friendlier multiprocessing wrapper:

from joblib import Parallel, delayed

def myhdf(date):
    # do work
    return np.mean(records)

results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)

Answer 2

The Pool classes map function is like the standard python libraries map function, you're guaranteed to get your results back in the order that you put them in. Knowing that, the only other trick is that you need to return results in a consistant manner, and the filter them afterwards.

import numpy as np
from multiprocessing import Pool

def myhdf(date):
    year    = date[0:4]
    month   = date[4:6]
    day     = date[6:8]
    rootdir = 'data/mydata/'
    filename = 'no2track'+year+month+day
    records = read_my_hdf(rootdir,filename)
    if records.size:
        return np.mean(records)

dates = ['20080105','20080106','20080107','20080108','20080109']

pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)

If you really do want results as soon as they are available you can use imap_unordered

Multiprocess Python/Numpy code for processing data faster

Question

2 answers

solution1
4 2012-10-25 11:38:03

solution2
2 ACCPTED 2012-10-25 09:52:15

Multiprocess Python/Numpy code for processing data faster

Question

2 answers

solution1 4 2012-10-25 11:38:03

solution2 2 ACCPTED 2012-10-25 09:52:15

solution1
4 2012-10-25 11:38:03

solution2
2 ACCPTED 2012-10-25 09:52:15