I find many similar questions but no answer. For simple array there is multiprocessing.Array. For sparse matrix or any other arbitrary object I find manager.namespace. So I tried the code below:
from scipy import sparse
from multiprocessing import Pool
import multiprocessing
import functools
def myfunc(x,ns):
return ns.A[x,:]*ns.A*ns.A[:,x]
manager = multiprocessing.Manager()
Global = manager.Namespace()
pool=Pool()
Global.A=sparse.rand(10000,10000,0.5,'csr')
myfunc2=functools.partial(myfunc,ns=Global)
r=pool.map(myfunc2, range(100))
The code works but not efficient. Only 4 of 16 workers are actually working. The reason is that, I guess, the manager allows only one worker to access the data at a time. Since the data is read only I don't really need a lock. So is there a more efficient way to do this?
ps, I have seen people talking about copy-on-write fork(). I don't really understand what it is but it does not work. If I generate A first and do Pool(), each process would have a copy of A.
Thank you in advance.
A property of a Namespace object is only updated when is it explicitly assigned to . Good explanations are given here .
Edit : And looking at the implementation (in multiprocessing/managers.py
), it does not seem to use shared memory. It just pickles objects and sends them to the child when requested. That is probably why it is taking so long.
Are you by any chance creating a pool with more workers than your CPU has cores? (Ie using the processes
argument of the Pool
constructor.) This is generally not a good idea.
There are a couple of other things you can try;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.