简体   繁体   中英

how to pass parameters other than data through pool.imap() function for multiprocessing in python?

I am working on hyperspectral images. To reduce the noise from the image I am using wavelet transformation using pywt package. When I am doing this normally(serial processing) it's working smoothly. But when I am trying to implement parallel processing using multiple cores for wavelet transformation on the image, then I have to pass certain parameters like

  1. wavelet family
  2. thresholding value
  3. threshold technique (hard/soft)

But I am not able to pass these parameters using the pool object, I can pass only data as an argument when I am using the pool.imap(). But when I am using the pool.apply_async() it's taking much more time and also the order of the output is not the same. Here I am adding the code for reference:

import matplotlib.pyplot as plt
import numpy as np
import multiprocessing as mp
import os
import time
from math import log10, sqrt
import pywt
import tifffile

def spec_trans(d,wav_fam,threshold_val,thresh_type):
  
  data=np.array(d,dtype=np.float64)
  data_dec=decomposition(data,wav_fam)
  data_t=thresholding(data_dec,threshold_val,thresh_type)
  data_rec=reconstruction(data_t,wav_fam)
  
  return data_rec

if __name__ == '__main__':
    
    #input
    X=tifffile.imread('data/Classification/university.tif')
    #take paramaters
    threshold_val=float(input("Enter the value for image thresholding: "))
    print("The available wavelet functions:",pywt.wavelist())
    wav_fam=input("Choose a wavelet function for transformation: ")
    threshold_type=['hard','soft']
    print("The available wavelet functions:",threshold_type)
    thresh_type=input("Choose a type for threshholding technique: ")

    start=time.time()
    p = mp.Pool(4)
    jobs=[]
    for dataBand in xmp:
      jobs.append(p.apply_async(spec_trans,args=(dataBand,wav_fam,threshold_val,thresh_type)))
    transformedX=[]
    for jobBit in jobs:
      transformedX.append(jobBit.get())
    end=time.time()
    p.close()


Also when I am using the 'soft' technique for thresholding I am facing the following error:

C:\Users\Sawon\anaconda3\lib\site-packages\pywt\_thresholding.py:25: RuntimeWarning: invalid value encountered in multiply
  thresholded = data * thresholded

The results of serial execution and parallel execution would be more or less the same. But here I am getting slightly different results. Any suggestion to modify the code will be helpful Thank

[This isn't a direct answer to the question, but a clearer follow up query than trying the below via the small comment box]

As a quick check, pass in an iterator counter to spec_trans and return it back out (as well as your result) - and push it into a separate list, transformedXseq or something - and then compare to your input sequence. ie

def spec_trans(d,wav_fam,threshold_val,thresh_type, iCount):

    data=np.array(d,dtype=np.float64)
    data_dec=decomposition(data,wav_fam)
    data_t=thresholding(data_dec,threshold_val,thresh_type)
    data_rec=reconstruction(data_t,wav_fam)

    return data_rec, iCount

and then within main

jobs=[]
iJobs = 0
for dataBand in xmp:
   jobs.append(p.apply_async(spec_trans,args=(dataBand,wav_fam,threshold_val,thresh_type, iJobs)))
   iJobs = iJobs + 1 

transformedX=[]
transformedXseq=[]
for jobBit in jobs:
    res = jobBit.get()
    transformedX.append(res[0])
    transformedXseq.append(res[1])

... and check the list transformedXseq to see if you've gathered the jobs back up in the sequence you submitted them. It should match!

Assuming wav_fam , threshold_val and thresh_type do not vary from call to call, first arrange for these arguments to be the first arguments to worker function spec_trans :

def spec_trans(wav_fam, threshold_val, thresh_type, d):

Now I don't see where in your pool-creation block you have defined xmp , but presumably this is an iterable. You need to modify this code as follows:

from functools import partial

def compute_chunksize(pool_size, iterable_size):
    chunksize, remainder = divmod(iterable_size, 4 * pool_size)
    if remainder:
        chunksize += 1
    return chunksize

if __name__ == '__main__':

    X=tifffile.imread('data/Classification/university.tif')
    #take paramaters
    threshold_val=float(input("Enter the value for image thresholding: "))
    print("The available wavelet functions:",pywt.wavelist())
    wav_fam=input("Choose a wavelet function for transformation: ")
    threshold_type=['hard','soft']
    print("The available wavelet functions:",threshold_type)
    thresh_type=input("Choose a type for threshholding technique: ")

    start=time.time()
    p = mp.Pool(4)
    # first 3 arguments to spec_trans will be wav_fam, threshold_val and thresh_type 
    worker = partial(spec_trans, wav_fam, threshold_val, thresh_type)
    suitable_chunksize = compute_chunksize(4, len(xmp))
    transformedX = list(p.imap(worker, xmp, chunksize=suitable_chunksize))
    end=time.time()

To obtain improved performance over using apply_async , you must use a "suitable chunksize" value with imap . Function compute_chunksize can be used for computing such a value based on the size of your pool, which is 4, and the size of the iterable being passed to imap , which would be len(xmp) . If the size of xmp is small enough such that the chunksize value computed is 1, I don't really see how imap would be significantly more performant over apply_async .

Of course, you might as well just use:

    transformedX = p.map(worker, xmp)

And let the pool compute its own suitable chunksize. imap has an advantage over map when the iterable is very large and not already a list. For map to compute a suitable chunksize it would first have to convert the iterable to a list just to get its length and this could be memory inefficient. But if you know the length (or approximate length) of the iterable, then by using imap you can explicitly set a chunksize without having to convert the iterable to a list. The other advantage of imap_unordered over map is that you can process the results for the individual tasks as they become available whereas with map you only get results when all the submitted tasks are complete.

Update

If you want to catch possible exceptions thrown by individual tasks submitted to your worker function, then stick with using imap , and use the following code to iterate the results returned by imap :

    #transformedX = list(p.imap(worker, xmp, chunksize=suitable_chunksize))
    transformedX = []
    results = p.imap(worker, xmp, chunksize=suitable_chunksize)
    import traceback
    while True:
        try:
            result = next(results)
        except StopIteration: # no more results
            break
        except Exception as e:
            print('Exception occurred:', e)
            traceback.print_exc() # print stacktrace
        else:
            transformedX.append(result)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM