简体   繁体   English

如何从 2 个维度 (m, n) 和 (n) 的数组中更好地执行 Pearson R,返回一个 (m) 大小的数组? [Python、NumPy、SciPy]

[英]How better perform Pearson R from 2 arrays of dimensions (m, n) and (n), returning an array of (m) size? [Python, NumPy, SciPy]

I'm trying to improve a simple algorithm to obtaining the Pearson correlation coefficient from two arrays, X(m, n) and Y(n) , returning me another array R of dimension (m) .我正在尝试改进一种简单的算法,以从两个数组X(m, n)Y(n)中获取 Pearson 相关系数,并返回另一个维度为(m)的数组R。
In the case, I want to know the behavior each row of X regarding the values of Y .在这种情况下,我想知道X的每一行关于Y值的行为。 A sample (working) code is presented below:示例(工作)代码如下所示:

import numpy as np
from scipy.stats import pearsonr

np.random.seed(1)
m, n = 10, 5

x = 100*np.random.rand(m, n)
y = 2 + 2*x.mean(0)
r = np.empty(m)

for i in range(m):
    r[i] = pearsonr(x[i], y)[0]

For this particular case, I get: r = array([0.95272843, -0.69134753, 0.36419159, 0.27467137, 0.76887201, 0.08823868, -0.72608421, -0.01224453, 0.58375626, 0.87442889])对于这种特殊情况,我得到: r = array([0.95272843, -0.69134753, 0.36419159, 0.27467137, 0.76887201, 0.08823868, -0.72608421, -0.01224453, 0.58375626, 0.87442889])

For small values of m (near 10k) this runs pretty fast, but I'm starting to work with m ~ 30k , and so this is taking much longer than I expected.对于m的小值(接近 10k),它运行得非常快,但我开始使用m ~ 30k ,所以这比我预期的要长得多。 I'm aware I could implement multiprocessing/multi-threading but I believe there's a (better) pythonic way of doing this.我知道我可以实现多处理/多线程,但我相信有一种(更好的) pythonic方式可以做到这一点。

I tried to use use pearsonr(x, np.ones((m, n))*y) , but it returns only (nan, nan) .我尝试使用 use pearsonr(x, np.ones((m, n))*y) ,但它只返回(nan, nan)

pearsonr only supports 1D array internally. pearsonr仅在内部支持一维数组。 Moreover, it computes the p-values which is not used here.此外,它计算此处未使用的 p 值。 Thus, it would be more efficient not to compute it if possible.因此,如果可能,不计算它会更有效。 Additionally, the code also recompute the y vector every time and it does not efficiently make use of vectorized Numpy operations.此外,代码还每次都重新计算y向量,并且它没有有效地利用向量化的 Numpy 操作。 This is why the computation is a bit slow.这就是计算速度有点慢的原因。 You can check this in the code here .您可以在此处的代码中检查这一点。

One way to compute this is by writing your own custom implementation based on the one of Scipy:一种计算方法是根据 Scipy 编写自己的自定义实现:

def multi_pearsonr(x, y):
    xmean = x.mean(axis=1)
    ymean = y.mean()
    xm = x - xmean[:,None]
    ym = y - ymean
    normxm = np.linalg.norm(xm, axis=1)
    normym = np.linalg.norm(ym)
    return np.clip(np.dot(xm/normxm[:,None], ym/normym), -1.0, 1.0)

It is 450 times faster on my machine for m = 10_000.对于 m = 10_000,它在我的机器上快 450 倍

Note that I did not keep the checks of the Scipy code, but it may be a good idea to keep them if your input is not guaranteed to be statistically safe (ie. well formatted for the computation of the Pearson test).请注意,我没有保留 Scipy 代码的检查,但如果您的输入不能保证在统计上是安全的(即,为 Pearson 测试的计算格式化),保留它们可能是一个好主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM