[英]Why this small snippet hangs using multiprocessing with maxtasksperchild, numpy.random.randint and numpy.random.seed?
I have a python script that concurrently processes numpy arrays and images in a random way. 我有一个python脚本,以随机的方式同时处理numpy数组和图像。 To have proper randomness inside the spawned processes I pass a random seed from the main process to the workers for them to be seeded. 为了在产生的进程中有适当的随机性,我将一个随机种子从主进程传递给工作者,以便为它们播种。
When I use maxtasksperchild
for the Pool
, my script hangs after running Pool.map
a number of times. 当我使用maxtasksperchild
作为Pool
,我的脚本在运行Pool.map
多次后挂起。
The following is a minimal snippet that reproduces the problem : 以下是重现问题的最小代码段:
# This code stops after multiprocessing.Pool workers are replaced one single time.
# They are replaced due to maxtasksperchild parameter to Pool
from multiprocessing import Pool
import numpy as np
def worker(n):
# Removing np.random.seed solves the issue
np.random.seed(1) #any seed value
return 1234 # trivial return value
# Removing maxtasksperchild solves the issue
ppool = Pool(20 , maxtasksperchild=5)
i=0
while True:
i += 1
# Removing np.random.randint(10) or taking it out of the loop solves the issue
rand = np.random.randint(10)
l = [3] # trivial input to ppool.map
result = ppool.map(worker, l)
print i,result[0]
This is the output 这是输出
1 1234 2 1234 3 1234 . . . 99 1234 100 1234 # at this point workers should've reached maxtasksperchild tasks 101 1234 102 1234 103 1234 104 1234 105 1234 106 1234 107 1234 108 1234 109 1234 110 1234
then hangs indefinitely. 然后无限期地挂起。
I could potentially replace numpy.random
with python's random
and get away with the problem. 我可以用python的random
替换numpy.random
并解决问题。 However in my actual application, the worker will execute user code (given as argument to the worker) which i have no control over, and would like to allow using numpy.random
functions in that user code. 但是在我的实际应用程序中,worker将执行我无法控制的用户代码(作为worker的参数给出),并且希望允许在该用户代码中使用numpy.random
函数。 So I intentionally want to seed the global random generator (for each process independently). 所以我故意想要为全局随机生成器播种(对于每个进程独立)。
This was tested with Python 2.7.10, numpy 1.11.0, 1.12.0 & 1.13.0, Ubuntu and OSX 这是使用Python 2.7.10,numpy 1.11.0,1.12.0和1.13.0,Ubuntu和OSX测试的
It turns out this is coming from a Python buggy interaction of threading.Lock
and multiprocessing
. 事实证明,这是来自threading.Lock
一个Python错误交互。 threading.Lock
和multiprocessing
。
np.random.seed
and most np.random.*
functions use a threading.Lock
to ensure thread-safety. np.random.seed
和大多数np.random.*
函数使用threading.Lock
来确保线程安全。 A np.random.*
function generates a random number then update the seed (shared across threads), that's why a lock is needed. np.random.*
函数生成随机数,然后更新种子(跨线程共享),这就是需要锁定的原因。 See np.random.seed and cont0_array (used by np.random.random()
and others). 请参见np.random.seed和cont0_array (由np.random.random()
和其他人使用)。
Now how does this cause a problem in the above snippet ? 现在,这是如何导致上述代码段中出现问题的?
In a nutshell, the snippet hangs because the threading.Lock
state is inherited when forking. 简而言之,代码段会挂起,因为在分叉时会继承threading.Lock
状态。 So when a child is forked at the same time the lock is acquired in the parent (by np.random.randint(10)
), the child deadlocks (at np.random.seed
). 因此,当子np.random.randint(10)
同时分叉时,在父np.random.randint(10)
获取锁(通过np.random.randint(10)
),子np.random.seed
死锁(在np.random.seed
)。
@njsmith explains it in this github issue https://github.com/numpy/numpy/issues/9248#issuecomment-308054786 @njsmith在这个github问题中解释了它https://github.com/numpy/numpy/issues/9248#issuecomment-308054786
multiprocessing.Pool spawns a background thread to manage workers: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L170-L173 multiprocessing.Pool产生后台线程来管理工作者: https : //github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L170-L173
It loops in the background calling _maintain_pool: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L366 它在后台循环调用_maintain_pool: https : //github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L366
If a worker exits, for example due to a maxtasksperchild limit, then _maintain_pool calls _repopulate_pool: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L240 如果一个worker退出,例如由于maxtasksperchild限制,则_maintain_pool调用_repopulate_pool: https : //github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L240
And then _repopulate_pool forks some new workers, still in this background thread: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L224 然后_repopulate_pool分配了一些新的工作者,仍然在这个后台线程中: https : //github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L224
So what's happening is that eventually you get unlucky, and at the same moment that your main thread is calling some np.random function and holding the lock, multiprocessing decides to fork a child, which starts out with the np.random lock already held but the thread that was holding it is gone. 所以正在发生的事情是,最终你变得不走运,同时你的主线程正在调用一些np.random函数并持有锁,多处理决定分叉一个子节点,该子节点始于np.random锁已经保持但是持有它的线程消失了。 Then the child tries to call into np.random, which requires taking the lock, and so the child deadlocks. 然后孩子试图调用np.random,这需要锁定,所以孩子死锁。
The simple workaround here is to not use fork with multiprocessing. 这里简单的解决方法是不使用fork进行多处理。 If you use the spawn or forkserver start methods then this should go away. 如果你使用spawn或forkserver启动方法,那么这应该消失。
For a proper fix.... ughhh. 为了妥善修复......呃。 I guess we.. need to register a pthread_atfork pre-fork handler that takes the np.random lock before fork and then releases it afterwards? 我想我们需要注册一个pthread_atfork前叉处理程序,它在fork之前获取np.random锁,然后释放它? And really I guess we need to do this for every lock in numpy, which requires something like keeping a weakset of every RandomState object, and _FFTCache also appears to have a lock... 而且我想我们需要为numpy中的每个锁执行此操作,这需要保持每个RandomState对象的弱集,并且_FFTCache似乎也有锁...
(On the plus side, this would also give us an opportunity to reinitialize the global random state in the child, which we really should be doing in cases where the user hasn't explicitly seeded it.) (从好的方面来说,这也会让我们有机会重新初始化孩子的全局随机状态,在用户没有明确地将其播种的情况下我们应该这样做。)
Using numpy.random.seed
is not thread safe. 使用numpy.random.seed
不是线程安全的。 numpy.random.seed
changes the value of the seed globally, while - as far as I understand - you are trying to change the seed locally. numpy.random.seed
更改种子的值,而 - 据我所知 - 你试图在本地更改种子。
If indeed what you are trying to achieve is having the generator seeded at the start of each worker, the following is a solution: 如果您确实想要实现的是在每个工作者的开头播种发生器,则以下是一个解决方案:
def worker(n):
# Removing np.random.seed solves the problem
randgen = np.random.RandomState(45678) # RandomState, not seed!
# ...Do something with randgen...
return 1234 # trivial return value
Making this a full-fledged answer since it doesn't fit in a comment. 这是一个完整的答案,因为它不适合评论。
After playing around a bit, something here smells like a numpy.random bug. 玩了一下后,这里的东西闻起来就像一个numpy.random错误。 I was able to reproduce the freezing bug, and in addition there were some other weird things that shouldn't happen, like manually seeding the generator not working. 我能够重现冻结虫子,此外还有一些不应该发生的奇怪事情,比如手动播种发电机不起作用。
def rand_seed(rand, i):
print(i)
np.random.seed(i)
print(i)
print(rand())
def test1():
with multiprocessing.Pool() as pool:
[pool.apply_async(rand_seed, (np.random.random_sample, i)).get()
for i in range(5)]
test1()
has output 有输出
0
0
0.3205032737431185
1
1
0.3205032737431185
2
2
0.3205032737431185
3
3
0.3205032737431185
4
4
0.3205032737431185
On the other hand, not passing np.random.random_sample as an argument works just fine. 另一方面,不传递np.random.random_sample作为参数可以正常工作。
def rand_seed2(i):
print(i)
np.random.seed(i)
print(i)
print(np.random.random_sample())
def test2():
with multiprocessing.Pool() as pool:
[pool.apply_async(rand_seed, (i,)).get()
for i in range(5)]
test2()
has output 有输出
0
0
0.5488135039273248
1
1
0.417022004702574
2
2
0.43599490214200376
3
3
0.5507979025745755
4
4
0.9670298390136767
This suggests some serious tomfoolery is going on behind the curtains. 这表明窗帘后面正在发生一些严重的蠢事。 Not sure what else to say about it though.... 不知道还有什么可说的......
Basically it seems like numpy.random.seed modifies not only the "seed state" variable, but the random_sample
function itself. 基本上,似乎numpy.random.seed不仅修改了“种子状态”变量,还修改了random_sample
函数本身。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.