简体   繁体   English

多进程Python / Numpy代码可更快地处理数据

[英]Multiprocess Python/Numpy code for processing data faster

I am reading in hundreds of HDF files and processing the data of each HDF seperately. 我正在读取数百个HDF文件,并分别处理每个HDF的数据。 However, this takes an awful amount of time, since it is working on one HDF file at a time. 但是,由于它一次只处理一个HDF文件,因此需要花费大量时间。 I just stumbled upon http://docs.python.org/library/multiprocessing.html and am now wondering how I can speed things up using multiprocessing. 我只是偶然发现http://docs.python.org/library/multiprocessing.html ,现在想知道如何使用多处理来加快处理速度。

So far, I came up with this: 到目前为止,我想到了这个:

import numpy as np
from multiprocessing import Pool

def myhdf(date):
    ii      = dates.index(date)
    year    = date[0:4]
    month   = date[4:6]
    day     = date[6:8]
    rootdir = 'data/mydata/'
    filename = 'no2track'+year+month+day
    records = read_my_hdf(rootdir,filename)
    if records.size:
        results[ii] = np.mean(records)

dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))

pool = Pool(len(dates))
pool.map(myhdf,dates)

However, this is obviously not correct. 但是,这显然是不正确的。 Can you follow my chain of thought what I want to do? 您能遵循我的想法去做吗? What do I need to change? 我需要更改什么?

Try joblib for a friendlier multiprocessing wrapper: 尝试joblib获得更友好的multiprocessing包装器:

from joblib import Parallel, delayed

def myhdf(date):
    # do work
    return np.mean(records)

results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)

The Pool classes map function is like the standard python libraries map function, you're guaranteed to get your results back in the order that you put them in. Knowing that, the only other trick is that you need to return results in a consistant manner, and the filter them afterwards. Pool classes 映射函数类似于标准的python库map函数,可以确保按放入顺序返回结果。知道这一点,唯一的另一招是需要以一致的方式返回结果,然后对其进行过滤。

import numpy as np
from multiprocessing import Pool

def myhdf(date):
    year    = date[0:4]
    month   = date[4:6]
    day     = date[6:8]
    rootdir = 'data/mydata/'
    filename = 'no2track'+year+month+day
    records = read_my_hdf(rootdir,filename)
    if records.size:
        return np.mean(records)

dates = ['20080105','20080106','20080107','20080108','20080109']

pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)

If you really do want results as soon as they are available you can use imap_unordered 如果您确实希望尽快获得结果,则可以使用imap_unordered

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM