为什么这个小片段使用maxtasksperchild，numpy.random.randint和numpy.random.seed进行多处理时挂起？

Question

I have a python script that concurrently processes numpy arrays and images in a random way. 我有一个python脚本，以随机的方式同时处理numpy数组和图像。 To have proper randomness inside the spawned processes I pass a random seed from the main process to the workers for them to be seeded. 为了在产生的进程中有适当的随机性，我将一个随机种子从主进程传递给工作者，以便为它们播种。

When I use maxtasksperchild for the Pool , my script hangs after running Pool.map a number of times. 当我使用maxtasksperchild作为Pool ，我的脚本在运行Pool.map多次后挂起。

The following is a minimal snippet that reproduces the problem : 以下是重现问题的最小代码段：

# This code stops after multiprocessing.Pool workers are replaced one single time.
# They are replaced due to maxtasksperchild parameter to Pool
from multiprocessing import Pool
import numpy as np

def worker(n):
    # Removing np.random.seed solves the issue
    np.random.seed(1) #any seed value
    return 1234 # trivial return value

# Removing maxtasksperchild solves the issue
ppool = Pool(20 , maxtasksperchild=5)
i=0
while True:
    i += 1
    # Removing np.random.randint(10) or taking it out of the loop solves the issue
    rand = np.random.randint(10)
    l  = [3] # trivial input to ppool.map
    result = ppool.map(worker, l)
    print i,result[0]

This is the output 这是输出

1 1234
2 1234
3 1234
.
.
.
99 1234
100 1234 # at this point workers should've reached maxtasksperchild tasks
101 1234
102 1234
103 1234
104 1234
105 1234
106 1234
107 1234
108 1234
109 1234
110 1234

then hangs indefinitely. 然后无限期地挂起。

I could potentially replace numpy.random with python's random and get away with the problem. 我可以用python的random替换numpy.random并解决问题。 However in my actual application, the worker will execute user code (given as argument to the worker) which i have no control over, and would like to allow using numpy.random functions in that user code. 但是在我的实际应用程序中，worker将执行我无法控制的用户代码（作为worker的参数给出），并且希望允许在该用户代码中使用numpy.random函数。 So I intentionally want to seed the global random generator (for each process independently). 所以我故意想要为全局随机生成器播种（对于每个进程独立）。

This was tested with Python 2.7.10, numpy 1.11.0, 1.12.0 & 1.13.0, Ubuntu and OSX 这是使用Python 2.7.10，numpy 1.11.0,1.12.0和1.13.0，Ubuntu和OSX测试的

Answer 1

It turns out this is coming from a Python buggy interaction of threading.Lock and multiprocessing . 事实证明，这是来自threading.Lock一个Python错误交互。 threading.Lock和multiprocessing 。

np.random.seed and most np.random.* functions use a threading.Lock to ensure thread-safety. np.random.seed和大多数np.random.*函数使用threading.Lock来确保线程安全。 A np.random.* function generates a random number then update the seed (shared across threads), that's why a lock is needed. np.random.*函数生成随机数，然后更新种子（跨线程共享），这就是需要锁定的原因。 See np.random.seed and cont0_array (used by np.random.random() and others). 请参见np.random.seed和cont0_array （由np.random.random()和其他人使用）。

Now how does this cause a problem in the above snippet ? 现在，这是如何导致上述代码段中出现问题的？

In a nutshell, the snippet hangs because the threading.Lock state is inherited when forking. 简而言之，代码段会挂起，因为在分叉时会继承threading.Lock状态。 So when a child is forked at the same time the lock is acquired in the parent (by np.random.randint(10) ), the child deadlocks (at np.random.seed ). 因此，当子np.random.randint(10)同时分叉时，在父np.random.randint(10)获取锁（通过np.random.randint(10) ），子np.random.seed死锁（在np.random.seed ）。

@njsmith explains it in this github issue https://github.com/numpy/numpy/issues/9248#issuecomment-308054786 @njsmith在这个github问题中解释了它https://github.com/numpy/numpy/issues/9248#issuecomment-308054786

multiprocessing.Pool spawns a background thread to manage workers: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L170-L173 multiprocessing.Pool产生后台线程来管理工作者： https ： //github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L170-L173

It loops in the background calling _maintain_pool: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L366 它在后台循环调用_maintain_pool： https ： //github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L366

If a worker exits, for example due to a maxtasksperchild limit, then _maintain_pool calls _repopulate_pool: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L240 如果一个worker退出，例如由于maxtasksperchild限制，则_maintain_pool调用_repopulate_pool： https ： //github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L240

And then _repopulate_pool forks some new workers, still in this background thread: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L224 然后_repopulate_pool分配了一些新的工作者，仍然在这个后台线程中： https ： //github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L224

So what's happening is that eventually you get unlucky, and at the same moment that your main thread is calling some np.random function and holding the lock, multiprocessing decides to fork a child, which starts out with the np.random lock already held but the thread that was holding it is gone. 所以正在发生的事情是，最终你变得不走运，同时你的主线程正在调用一些np.random函数并持有锁，多处理决定分叉一个子节点，该子节点始于np.random锁已经保持但是持有它的线程消失了。 Then the child tries to call into np.random, which requires taking the lock, and so the child deadlocks. 然后孩子试图调用np.random，这需要锁定，所以孩子死锁。

The simple workaround here is to not use fork with multiprocessing. 这里简单的解决方法是不使用fork进行多处理。 If you use the spawn or forkserver start methods then this should go away. 如果你使用spawn或forkserver启动方法，那么这应该消失。

For a proper fix.... ughhh. 为了妥善修复......呃。 I guess we.. need to register a pthread_atfork pre-fork handler that takes the np.random lock before fork and then releases it afterwards? 我想我们需要注册一个pthread_atfork前叉处理程序，它在fork之前获取np.random锁，然后释放它？ And really I guess we need to do this for every lock in numpy, which requires something like keeping a weakset of every RandomState object, and _FFTCache also appears to have a lock... 而且我想我们需要为numpy中的每个锁执行此操作，这需要保持每个RandomState对象的弱集，并且_FFTCache似乎也有锁...

(On the plus side, this would also give us an opportunity to reinitialize the global random state in the child, which we really should be doing in cases where the user hasn't explicitly seeded it.) （从好的方面来说，这也会让我们有机会重新初始化孩子的全局随机状态，在用户没有明确地将其播种的情况下我们应该这样做。）

Answer 2

Using numpy.random.seed is not thread safe. 使用numpy.random.seed不是线程安全的。 numpy.random.seed changes the value of the seed globally, while - as far as I understand - you are trying to change the seed locally. numpy.random.seed更改种子的值，而 - 据我所知 - 你试图在本地更改种子。

See the docs 查看文档

If indeed what you are trying to achieve is having the generator seeded at the start of each worker, the following is a solution: 如果您确实想要实现的是在每个工作者的开头播种发生器，则以下是一个解决方案：

def worker(n):
    # Removing np.random.seed solves the problem                                                               
    randgen = np.random.RandomState(45678) # RandomState, not seed!
    # ...Do something with randgen...                                           
    return 1234 # trivial return value

Answer 3

Making this a full-fledged answer since it doesn't fit in a comment. 这是一个完整的答案，因为它不适合评论。

After playing around a bit, something here smells like a numpy.random bug. 玩了一下后，这里的东西闻起来就像一个numpy.random错误。 I was able to reproduce the freezing bug, and in addition there were some other weird things that shouldn't happen, like manually seeding the generator not working. 我能够重现冻结虫子，此外还有一些不应该发生的奇怪事情，比如手动播种发电机不起作用。

def rand_seed(rand, i):
    print(i)
    np.random.seed(i)
    print(i)
    print(rand())
def test1():
    with multiprocessing.Pool() as pool:
        [pool.apply_async(rand_seed, (np.random.random_sample, i)).get()
        for i in range(5)]
test1()

has output 有输出

0
0
0.3205032737431185
1
1
0.3205032737431185
2
2
0.3205032737431185
3
3
0.3205032737431185
4
4
0.3205032737431185

On the other hand, not passing np.random.random_sample as an argument works just fine. 另一方面，不传递np.random.random_sample作为参数可以正常工作。

def rand_seed2(i):
    print(i)
    np.random.seed(i)
    print(i)
    print(np.random.random_sample())
def test2():
    with multiprocessing.Pool() as pool:
        [pool.apply_async(rand_seed, (i,)).get()
        for i in range(5)]
test2()

has output 有输出

0
0
0.5488135039273248
1
1
0.417022004702574
2
2
0.43599490214200376
3
3
0.5507979025745755
4
4
0.9670298390136767

This suggests some serious tomfoolery is going on behind the curtains. 这表明窗帘后面正在发生一些严重的蠢事。 Not sure what else to say about it though.... 不知道还有什么可说的......

Basically it seems like numpy.random.seed modifies not only the "seed state" variable, but the random_sample function itself. 基本上，似乎numpy.random.seed不仅修改了“种子状态”变量，还修改了random_sample函数本身。

为什么这个小片段使用maxtasksperchild，numpy.random.randint和numpy.random.seed进行多处理时挂起？

问题描述

3 个解决方案

解决方案1
3 已采纳 2017-06-13 11:22:50

解决方案2
1 2017-06-11 21:54:44

解决方案3
0 2017-06-12 23:14:16

为什么这个小片段使用maxtasksperchild，numpy.random.randint和numpy.random.seed进行多处理时挂起？

问题描述

3 个解决方案

解决方案1 3 已采纳 2017-06-13 11:22:50

解决方案2 1 2017-06-11 21:54:44

解决方案3 0 2017-06-12 23:14:16

解决方案1
3 已采纳 2017-06-13 11:22:50

解决方案2
1 2017-06-11 21:54:44

解决方案3
0 2017-06-12 23:14:16