简体   繁体   English

scipy.stats是否为不同的计算机硬件生成不同的随机数?

[英]Does scipy.stats produce different random numbers for different computer hardware?

I'm having a problem where I'm getting different random numbers across different computers despite 我遇到了一个问题,即我在不同的计算机上获得不同的随机数

  • scipy.__version__ == '1.2.1' on all computers 所有计算机上的scipy.__version__ == '1.2.1'
  • numpy.__version__ == '1.15.4' on all computers 所有计算机上的numpy.__version__ == '1.15.4'
  • random_state seed is fixed to the same number (42) in every function call that generates random numbers for reproducible results random_state种子在每个函数调用中固定为相同的数字(42),该函数调用生成随机数以获得可重现的结果

The code is a bit to complex to post in full here, but I noticed results start to diverge specifically when sampling from a multivariate normal : 这里的代码有点复杂到完全发布,但我注意到从多变量法线采样时结果开始有所不同:

import numpy as np
from scipy import stats
seed = 42
n_sim = 1000000
d = corr_mat.shape[0] # corr_mat is a 15x15 correlation matrix, numpy.ndarray
# results diverge from here across different hardware
z = stats.multivariate_normal(mean=np.zeros(d), cov=corr_mat).rvs(n_sim, random_state=seed)

corr_mat is a correlation matrix (see Appendix below) and is the same across all computers. corr_mat是一个相关矩阵(见下面的附录),并且在所有计算机上都是相同的。

The two different computers we are testing on are 我们正在测试的两台不同的计算机是

Computer 1 电脑1


  • OS: Windows 7 操作系统:Windows 7
  • Processor: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60Ghz 2.60 Ghz (2 processors) 处理器:Intel(R)Xeon(R)CPU E5-2623 v4 @ 2.60Ghz 2.60 Ghz(2个处理器)
  • RAM: 64 GB RAM:64 GB
  • System type: 64-bit 系统类型:64位

Computer 2 电脑2


  • OS: Windows 7 操作系统:Windows 7
  • Processor: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.10Ghz 2.10 Ghz (2 processors) 处理器:英特尔(R)Xeon(R) CPU E5-2660 v3 @ 2.10Ghz 2.10 Ghz(2个处理器)
  • RAM: 64 GB RAM:64 GB
  • System type: 64-bit 系统类型:64位

Appendix 附录

corr_mat
>>> array([[1.  , 0.15, 0.25, 0.25, 0.25, 0.25, 0.1 , 0.1 , 0.1 , 0.25, 0.25,
        0.25, 0.1 , 0.1 , 0.1 ],
       [0.15, 1.  , 0.  , 0.  , 0.  , 0.  , 0.15, 0.05, 0.15, 0.15, 0.15,
        0.  , 0.15, 0.15, 0.15],
       [0.25, 0.  , 1.  , 0.25, 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 1.  , 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 0.25, 1.  , 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 0.25, 0.25, 1.  , 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 1.  , 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.1 , 0.05, 0.  , 0.  , 0.  , 0.  , 0.15, 1.  , 0.15, 0.15, 0.15,
        0.  , 0.15, 0.15, 0.15],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 1.  , 0.25, 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 1.  , 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 1.  ,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.  , 0.25, 0.25, 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        1.  , 0.2 , 0.2 , 0.2 ],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 1.  , 0.25, 0.25],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 1.  , 0.25],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 0.25, 1.  ]])

The following is an educated guess which I cannot validate since I don't have multiple machines. 以下是一个有根据的猜测,我无法验证,因为我没有多台机器。

Sampling from a correlated multinormal is typically done by sampling from an uncorrelated standard normal and then multiplying with a "square root" of the covariance matrix. 从相关多正规的采样通常通过从不相关的标准法线采样然后乘以协方差矩阵的“平方根”来完成。 I get a fairly similar sample to the one scipy produces with seed set at 42 and your covariance matrix if I use instead identity(15) for the covariance and then multiply with l*sqrt(d) where l,d,r = np.linalg.svd(covariance) 我得到一个非常相似的样本,一个scipy产生种子设置为42,你的协方差矩阵,如果我使用identity(15)代表协方差,然后乘以l*sqrt(d) ,其中l,d,r = np.linalg.svd(covariance)

SVD is I suppose complex enough to explain small differences between platforms. 我认为SVD足够复杂,可以解释平台之间的微小差异。

How can this snowball into something significant? 怎么能把这个雪球变成重要的东西呢?

I think your choice of covariance matrix is to blame, since it has nonunique eigenvalues. 我认为你的协方差矩阵的选择是罪魁祸首,因为它具有非唯一的特征值。 As a consequence SVD is not unique, since eigenspaces to a given multiple eigenvalue can be rotated. 因此,SVD不是唯一的,因为可以旋转给定的多个特征值的本征空间。 This has the potential to hugely amplify a small numerical difference. 这有可能极大地放大一个小的数值差异。

It would be interesting to see whether the differences you see persist if you test with a different covariance matrix with unique eigenvalues. 如果您使用具有唯一特征值的不同协方差矩阵进行测试,那么看看您看到的差异是否仍然存在将会很有趣。

Edit : 编辑

For reference, here is what i tried for your smaller (6D) example: 作为参考,这是我为你的小(6D)示例尝试的:

>>> cm6 = np.array([[1,.5,.15,.15,0,0], [.5,1,.15,.15,0,0],[.15,.15,1,.25,0,0],[.15,.15,.25,1,0,0],[0,0,0,0,1,.1],[0,0,0,0,.1,1]])
>>> ls6,ds6,rs6 = np.linalg.svd(cm6)
>>> np.random.seed(42)
>>> cs6 = stats.multivariate_normal(cov=cm6).rvs()
>>> np.random.seed(42)
>>> is6 = stats.multivariate_normal(cov=np.identity(6)).rvs()
>>> LS6 = ls6*np.sqrt(ds6)
>>> np.allclose(cs6, LS6@is6)
True

As you report that the problem persists with unique eigenvalues here is one more possibility. 当您报告问题持续存在唯一的特征值时,这是另一种可能性。 Above I have used svd to compute eigen vectors / values which is ok since cov is symmetric. 上面我用svd来计算特征向量/值,这是好的,因为cov是对称的。 What happens if we use eigh instead? 如果我们使用eigh会发生什么?

>>> de6,le6 = np.linalg.eigh(cm6)
>>> LE6 = le6*np.sqrt(de6)
>>> cs6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -0.91913994,
        1.12421904])
>>> LE6@is6
array([ 0.54338614,  1.04010029, -0.71379193, -0.88313042, -0.60813547,
        0.26082989])

These are different. 这些是不同的。 Why? 为什么? First, eigh orders the eigenspaces the other way round: 首先, eigh命令本征空间:

>>> ds6
array([1.7 , 1.1 , 1.05, 0.9 , 0.75, 0.5 ])
>>> de6
array([0.5 , 0.75, 0.9 , 1.05, 1.1 , 1.7 ])

Does that fix it? 这样可以解决吗? Almost. 几乎。

>>> LE6[:, ::-1]@is6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -1.12421904,
        0.91913994])

We see that the last two samples are swapped and their signs flipped. 我们看到最后两个样本被交换并且它们的标志被翻转。 Turns out this is due to the sign of one eigen vector being inverted. 事实证明这是由于一个特征向量被反转的符号。

So even for unique eigen values we can get large differences because of ambiguities in (1) the order of eigen spaces and (2) the sign of eigen vectors. 因此,即使对于唯一的特征值,由于(1)特征空间的阶数和(2)特征向量的符号的模糊性,我们可以得到很大的差异。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM