如何从 2 个维度 (m, n) 和 (n) 的数组中更好地执行 Pearson R，返回一个 (m) 大小的数组？ [Python、NumPy、SciPy]

Question

I'm trying to improve a simple algorithm to obtaining the Pearson correlation coefficient from two arrays, X(m, n) and Y(n) , returning me another array R of dimension (m) .我正在尝试改进一种简单的算法，以从两个数组X(m, n)和Y(n)中获取 Pearson 相关系数，并返回另一个维度为(m)的数组R。
In the case, I want to know the behavior each row of X regarding the values of Y .在这种情况下，我想知道X的每一行关于Y值的行为。 A sample (working) code is presented below:示例（工作）代码如下所示：

import numpy as np
from scipy.stats import pearsonr

np.random.seed(1)
m, n = 10, 5

x = 100*np.random.rand(m, n)
y = 2 + 2*x.mean(0)
r = np.empty(m)

for i in range(m):
    r[i] = pearsonr(x[i], y)[0]

For this particular case, I get: r = array([0.95272843, -0.69134753, 0.36419159, 0.27467137, 0.76887201, 0.08823868, -0.72608421, -0.01224453, 0.58375626, 0.87442889])对于这种特殊情况，我得到： r = array([0.95272843, -0.69134753, 0.36419159, 0.27467137, 0.76887201, 0.08823868, -0.72608421, -0.01224453, 0.58375626, 0.87442889])

For small values of m (near 10k) this runs pretty fast, but I'm starting to work with m ~ 30k , and so this is taking much longer than I expected.对于m的小值（接近 10k），它运行得非常快，但我开始使用m ~ 30k ，所以这比我预期的要长得多。 I'm aware I could implement multiprocessing/multi-threading but I believe there's a (better) pythonic way of doing this.我知道我可以实现多处理/多线程，但我相信有一种（更好的） pythonic方式可以做到这一点。

I tried to use use pearsonr(x, np.ones((m, n))*y) , but it returns only (nan, nan) .我尝试使用 use pearsonr(x, np.ones((m, n))*y) ，但它只返回(nan, nan) 。

Answer 1

pearsonr only supports 1D array internally. pearsonr仅在内部支持一维数组。 Moreover, it computes the p-values which is not used here.此外，它计算此处未使用的 p 值。 Thus, it would be more efficient not to compute it if possible.因此，如果可能，不计算它会更有效。 Additionally, the code also recompute the y vector every time and it does not efficiently make use of vectorized Numpy operations.此外，代码还每次都重新计算y向量，并且它没有有效地利用向量化的 Numpy 操作。 This is why the computation is a bit slow.这就是计算速度有点慢的原因。 You can check this in the code here .您可以在此处的代码中检查这一点。

One way to compute this is by writing your own custom implementation based on the one of Scipy:一种计算方法是根据 Scipy 编写自己的自定义实现：

def multi_pearsonr(x, y):
    xmean = x.mean(axis=1)
    ymean = y.mean()
    xm = x - xmean[:,None]
    ym = y - ymean
    normxm = np.linalg.norm(xm, axis=1)
    normym = np.linalg.norm(ym)
    return np.clip(np.dot(xm/normxm[:,None], ym/normym), -1.0, 1.0)

It is 450 times faster on my machine for m = 10_000.对于 m = 10_000，它在我的机器上快 450 倍。

Note that I did not keep the checks of the Scipy code, but it may be a good idea to keep them if your input is not guaranteed to be statistically safe (ie. well formatted for the computation of the Pearson test).请注意，我没有保留 Scipy 代码的检查，但如果您的输入不能保证在统计上是安全的（即，为 Pearson 测试的计算格式化），保留它们可能是一个好主意。

如何从 2 个维度 (m, n) 和 (n) 的数组中更好地执行 Pearson R，返回一个 (m) 大小的数组？ [Python、NumPy、SciPy]

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-23 19:39:40

如何从 2 个维度 (m, n) 和 (n) 的数组中更好地执行 Pearson R，返回一个 (m) 大小的数组？ [Python、NumPy、SciPy]

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-23 19:39:40

解决方案1
1 已采纳 2022-05-23 19:39:40