简体   繁体   中英

Python multiprocessing with shared RawArray

I want to have multiple processes read from a different row of a numpy array in parallel to speed things up. However, when I run the following code, the first process to reach func throws an error as if var is no longer in scope. Why is this happening?

import numpy as np
import multiprocessing as mp

num_procs = 16
num_points = 2500000

def init_worker(X):
    global var
    var = X

def func(proc):
    X_np = np.frombuffer(var).reshape((num_procs, num_points))
    for y in range(num_points):
        z = X_np[proc][y]

if __name__ == '__main__':
    data = np.random.randn(num_procs, num_points)
    X = mp.RawArray('d', num_procs*num_points)
    X_np = np.frombuffer(X).reshape((num_procs, num_points))
    np.copyto(X_np, data)
    pool = mp.Pool(processes=4, initializer=init_worker, initargs=(X,))
    for proc in range(num_procs):
        pool.apply_async(func(proc))
    pool.close()
    pool.join()
Traceback (most recent call last):
  File "parallel_test.py", line 26, in <module>
    pool.apply_async(func(proc))
  File "parallel_test.py", line 13, in func
    X_np = np.frombuffer(var).reshape((num_procs, num_points))
NameError: global name 'var' is not defined

Update: For some reason, if I use Pool.map instead of the for loop with Pool.apply_async, it seems to work. I don't understand why though.

Any reason to not declare X as global in the top-level scope? This eliminates the NameError .

import numpy as np
import multiprocessing as mp

num_procs = 16 
num_points = 25000000


def func(proc):
    X_np = np.frombuffer(X).reshape((num_procs, num_points))
    for y in range(num_points):
        z = X_np[proc][y]

if __name__ == '__main__':
    data = np.random.randn(num_procs, num_points)
    global X 
    X = mp.RawArray('d', num_procs*num_points)
    X_np = np.frombuffer(X).reshape((num_procs, num_points))
    np.copyto(X_np, data)
    pool = mp.Pool(processes=4 )
    for proc in range(num_procs):
        pool.apply_async(func(proc))
    pool.close()
    pool.join()

When I run a reduced instance of this problem, n=20:

import numpy as np
import multiprocessing as mp

num_procs = 4 
num_points = 5


def func(proc):
    X_np = np.frombuffer(X).reshape((num_procs, num_points))
    for y in range(num_points):
        z = X_np[proc][y]

if __name__ == '__main__':
    data = np.random.randn(num_procs, num_points)
    global X 
    X = mp.RawArray('d', num_procs*num_points)
    X_np = np.frombuffer(X).reshape((num_procs, num_points))
    np.copyto(X_np, data)
    pool = mp.Pool(processes=4 )
    for proc in range(num_procs):
        pool.apply_async(func(proc))
    pool.close()
    pool.join()
    print("\n".join(map(str, X)))

I get the following output:

-0.6346037804619162
1.1005724710066107
0.33458763357165255
0.6409345714971889
0.7124888766851982
0.36760459213332963
0.23593304931386933
-0.8668969562941349
-0.8842756219923469
0.005979036105620422
1.386422154089567
-0.8770988782214508
0.25187448339771057
-0.2473967968471952
-0.4909708883978521
0.5423521489750244
0.018749603867333802
0.035304792504378055
1.3263872668956616
1.0199839603892742

You haven't provided a sample of the expected output. Does this look similar to what you expect?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM