简体   繁体   English

numpy.random多重处理中的种子

[英]numpy.random Seed in multiprocessing

I have a distributed process of a random process. 我有一个随机过程的分布式过程。 Therefor I use the numpy.random.RandomState to seed the numbers. 因此,我使用numpy.random.RandomState播种数字。 The problem is that I have to use another numpy.random function inside my wrapper. 问题是我必须在包装器内使用另一个numpy.random函数。 Now I am losing the reproducibility of the seed because I cant control the order of the function calls. 现在,我失去了种子的可重复性,因为我无法控制函数调用的顺序。

A short version of this problem would be: 此问题的简短版本是:

import numpy as np
import multiprocessing 

def function(N):
    return RDS.choice(range(N))

def wrapper(ic):
    return ic,function(ic)

RDS = np.random.RandomState(0)   

inputlist = []   
for i in range(30):
   inputlist.append((RDS.randint(1,100),))

pool = multiprocessing.Pool(4)

solutions_list = pool.starmap(wrapper, inputlist) 

pool.close() 
pool.join()

print(solutions_list)

I can not run function(ic) outside of wrapper because in my code it further depends on calculation results. 我不能在包装器外部运行function(ic) ,因为在我的代码中它进一步取决于计算结果。

Is there another way to set the seed properly? 还有另一种方法来正确设置种子吗?

Setting the seed differently isn't going to solve your reproducibility problem. 设置不同的种子并不能解决您的可重复性问题。 (It'd solve another problem we'll get to later, but it won't solve the reproducibility problem.) Your reproducibility issue comes from the nondeterministic assignment of tasks to workers, which is not controlled by any random seed. (这将解决我们稍后要解决的另一个问题,但不会解决可再现性问题。)您的可再现性问题来自于不确定性的任务分配给工人的任务,该任务不受任何随机种子的控制。

To solve the reproducibility issue, you need to assign tasks deterministically. 要解决可再现性问题,您需要确定性地分配任务。 One way to do that would be to abandon the use of the process pool and assign jobs to processes manually. 一种方法是放弃使用进程池,并手动将作业分配给进程。

The other problem is that your workers are all inheriting the same random seed. 另一个问题是您的工人都在继承相同的随机种子。 (They don't share the same RDS object - this isn't threading - but their copies of RDS are initialized identically.) This can lead to them producing identical or extremely correlated output, ruining your results. (它们不共享相同的RDS对象-这不是线程处理-但是它们的RDS副本初始化相同。)这可能导致它们产生相同或高度相关的输出,从而破坏您的结果。 To fix this, each worker should reseed RDS to a distinct seed on startup. 要解决此问题,每个工作人员应在启动时将RDS种子化为不同的种子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM