简体   繁体   English

两个熊猫数据帧之间的快速Spearman相关

[英]Fast spearman correlation between two pandas dataframes

I want to apply spearman correlation to two pandas dataframes with the same number of columns (correlation of each pair of rows). 我想将spearman相关性应用于具有相同列数(每对行的相关性)的两个pandas数据帧。

My objective is to compute the distribution of spearman correlations between each pair of rows (r, s) where r is a row from the first dataframe and s is a row from the second dataframe. 我的目标是计算每对行(r,s)之间的spearman相关性分布,其中r是第一个数据帧的一行,s是第二个数据帧的一行。

I am aware that similar questions have been answered before (see this ). 我知道以前已经回答过类似的问题(请参阅参考资料 )。 However, this question differs because I want to compare each row of first dataframe with ALL the rows in the second. 但是,此问题有所不同,因为我想将第一个数据帧的每一行与第二个数据帧的所有行进行比较。 Additionally, this is computationally intensive and it takes hours due to the size of my data. 此外,由于我的数据量大,因此这需要大量的计算,并且要花费数小时。 I want to parallelize it and possibly to rewrite it in order to speed it up. 我想对其进行并行化,并可能对其进行重写以加快速度。

I tried with numba but unfortunately it fails (similar issue to this ) because it seems to not recognize scipy spearmanr . 我尝试使用numba,但是不幸的是它失败了(与类似),因为它似乎无法识别scipy spearmanr My code is the following: 我的代码如下:

def corr(a, b):
    dist = []
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            dist += [spearmanr(a.iloc[i, :], b.iloc[j, :])[0]]
    return dist

NEW ANSWER 新答案

from numba import njit
import pandas as pd
import numpy as np

@njit
def mean1(a):
  n = len(a)
  b = np.empty(n)
  for i in range(n):
    b[i] = a[i].mean()
  return b

@njit
def std1(a):
  n = len(a)
  b = np.empty(n)
  for i in range(n):
    b[i] = a[i].std()
  return b

@njit
def c(a, b):
    ''' Correlation '''
    n, k = a.shape
    m, k = b.shape

    mu_a = mean1(a)
    mu_b = mean1(b)
    sig_a = std1(a)
    sig_b = std1(b)

    out = np.empty((n, m))

    for i in range(n):
        for j in range(m):
            out[i, j] = (a[i] - mu_a[i]) @ (b[j] - mu_b[j]) / k / sig_a[i] / sig_b[j]

    return out

r = df_test.rank(1).values
df_test.T.corr('spearman') == c(r, r)

OLD ANSWER 老答案

Doing a Spearman Rank correlation is simply doing a correlation of the ranks. 进行Spearman等级相关只是在进行等级之间的相关。

Rank

We can leverage argsort to get ranks. 我们可以利用argsort来获得排名。 Though the argsort of the argsort does get us the ranks, we can limit ourselves to one sort by slice assigning. 尽管argsortargsort确实获得了排名,但我们可以通过切片分配将自己限制为一种。

def rank(a):
  i, j = np.meshgrid(*map(np.arange, a.shape), indexing='ij')

  s = a.argsort(1)
  out = np.empty_like(s)
  out[i, s] = j

  return out

Correlation 相关性

In the case of correlating ranks, the means and standard deviations are all predetermined by the size of the second dimension of the array. 在等级相关的情况下,均值和标准差均由数组第二维的大小预先确定。

You can accomplish this same thing without numba, but I'm assuming you want it. 您可以不使用numba来完成同样的事情,但是我假设您想要这样做。

from numba import njit

@njit
def c(a, b):
  n, k = a.shape
  m, k = b.shape

  mu = (k - 1) / 2
  sig = ((k - 1) * (k + 1) / 12) ** .5

  out = np.empty((n, m))

  a = a - mu
  b = b - mu

  for i in range(n):
    for j in range(m):
      out[i, j] = a[i] @ b[j] / k / sig ** 2

  return out

For posterity, we could avoid the internal loop altogether but this might have memory issues. 为了后代,我们可以完全避免内部循环,但这可能会遇到内存问题。

@njit
def c1(a, b):
  n, k = a.shape
  m, k = b.shape

  mu = (k - 1) / 2
  sig = ((k - 1) * (k + 1) / 12) ** .5

  a = a - mu
  b = b - mu

  return a @ b.T / k / sig ** 2

Demonstration 示范

np.random.seed([3, 1415])

a = np.random.randn(2, 10)
b = np.random.randn(2, 10)

rank_a = rank(a)
rank_b = rank(b)

c(rank_a, rank_b)

array([[0.32121212, 0.01818182],
       [0.13939394, 0.55151515]])

If you were working with DataFrame 如果您正在使用DataFrame

da = pd.DataFrame(a)
db = pd.DataFrame(b)

pd.DataFrame(c(rank(da.values), rank(db.values)), da.index, db.index)


          0         1
0  0.321212  0.018182
1  0.139394  0.551515

Validation 验证方式

We can do a quick validation using pandas.DataFrame.corr 我们可以使用pandas.DataFrame.corr进行快速验证

pd.DataFrame(a.T).corr('spearman') == c(rank_a, rank_a)

      0     1
0  True  True
1  True  True

Here is a rowbased, uncompiled version of scipy.stats.spearmanr that uses ~ 5% of the time on large datasets with an example showing it produces the same result: 下面是一个rowbased的,未编译版本scipy.stats.spearmanr即对一个示例性的大型数据集的时间的用途〜5%表示它产生相同的结果:

import numpy as np

import pandas as pd


def spearman_row(x, y):

    x = np.asarray(x)
    y = np.asarray(y)

    rx = rankdata_average(x)
    ry = rankdata_average(y)

    # print(rx)
    # print(ry)

    return compute_corr(rx, ry)

def compute_corr(x, y):

    # Thanks to https://github.com/dengemann

    def ss(a, axis):
        return np.sum(a * a, axis=axis)

    x = np.asarray(x)
    y = np.asarray(y)

    mx = x.mean(axis=-1)
    my = y.mean(axis=-1)

    xm, ym = x - mx[..., None], y - my[..., None]

    r_num = np.add.reduce(xm * ym, axis=-1)
    r_den = np.sqrt(ss(xm, axis=-1) * ss(ym, axis=-1))

    with np.errstate(divide='ignore', invalid="ignore"):

        r = r_num / r_den

    return r


def rankdata_average(data):

    """Row-based rankdata using method=mean"""

    dc = np.asarray(data).copy()
    sorter = np.apply_along_axis(np.argsort, 1, data)

    inv = np.empty(data.shape, np.intp)

    ranks = np.tile(np.arange(data.shape[1]), (len(data), 1))

    np.put_along_axis(inv, sorter, ranks, axis=1)

    dc = np.take_along_axis(dc, sorter, 1)

    res = np.apply_along_axis(lambda r: r[1:] != r[:-1], 1, dc)

    obs = np.column_stack([np.ones(len(res), dtype=bool), res])

    dense = np.take_along_axis(np.apply_along_axis(np.cumsum, 1, obs), inv, 1)

    len_r = obs.shape[1]

    nonzero = np.count_nonzero(obs, axis=1)
    obs = pd.DataFrame(obs)
    nonzero = pd.Series(nonzero)
    dense = pd.DataFrame(dense)

    ranks = []
    for _nonzero, nzdf in obs.groupby(nonzero, sort=False):

        nz = np.apply_along_axis(lambda r: np.nonzero(r)[0], 1, nzdf)

        _count = np.column_stack([nz, np.ones(len(nz)) * len_r])
        _dense = dense.reindex(nzdf.index).values

        _result = 0.5 * (np.take_along_axis(_count, _dense, 1) + np.take_along_axis(_count, _dense - 1, 1) + 1)

        result = pd.DataFrame(_result, index=nzdf.index)
        ranks.append(result)

    final = pd.concat(ranks).sort_index()

    return final


if __name__ == "__main__":

    from scipy.stats import rankdata, spearmanr
    from time import time

    np.random.seed(0)

    size = int(1e5), 5
    d1 = np.random.randint(5, size=size)
    d2 = np.random.randint(5, size=size)

    start = time()
    actual = spearman_row(d1, d2)
    end = time()
    print("actual", actual)
    print("rowbased took", end - start)

    start = time()
    expected = []
    for i in range(len(d1)):
        expected.append(spearmanr(d1[i], d2[i]).correlation)
    end = time()
    print("scipy took", end - start)

    expected = np.array(expected)

    print("largest diff", pd.Series(expected - actual).abs().max())

It prints: 它打印:

rowbased took 3.6308434009552
scipy took 53.552557945251465
largest diff 2.220446049250313e-16 

Pandas has a corr function with the support of spearman . 熊猫在spearman的支持下具有corr功能。 It works on columns, so we can just transpose the dataFrame. 它适用于列,因此我们可以转置dataFrame。

We will append df1 to df2 and calculate the correlation by iterating over each row 我们将df1附加到df2并通过迭代每一行来计算相关性

len_df1 = df1.shape[0]
df2_index = df2.index.values.tolist()


df = df2.append(df1).reset_index(drop=True).T
values = {i: [df.iloc[:,df2_index+[i]].corr(method='spearman').values] for i in range(len_df1)}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM