简体   繁体   中英

Fast spearman correlation between two pandas dataframes

I want to apply spearman correlation to two pandas dataframes with the same number of columns (correlation of each pair of rows).

My objective is to compute the distribution of spearman correlations between each pair of rows (r, s) where r is a row from the first dataframe and s is a row from the second dataframe.

I am aware that similar questions have been answered before (see this ). However, this question differs because I want to compare each row of first dataframe with ALL the rows in the second. Additionally, this is computationally intensive and it takes hours due to the size of my data. I want to parallelize it and possibly to rewrite it in order to speed it up.

I tried with numba but unfortunately it fails (similar issue to this ) because it seems to not recognize scipy spearmanr . My code is the following:

def corr(a, b):
    dist = []
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            dist += [spearmanr(a.iloc[i, :], b.iloc[j, :])[0]]
    return dist

NEW ANSWER

from numba import njit
import pandas as pd
import numpy as np

@njit
def mean1(a):
  n = len(a)
  b = np.empty(n)
  for i in range(n):
    b[i] = a[i].mean()
  return b

@njit
def std1(a):
  n = len(a)
  b = np.empty(n)
  for i in range(n):
    b[i] = a[i].std()
  return b

@njit
def c(a, b):
    ''' Correlation '''
    n, k = a.shape
    m, k = b.shape

    mu_a = mean1(a)
    mu_b = mean1(b)
    sig_a = std1(a)
    sig_b = std1(b)

    out = np.empty((n, m))

    for i in range(n):
        for j in range(m):
            out[i, j] = (a[i] - mu_a[i]) @ (b[j] - mu_b[j]) / k / sig_a[i] / sig_b[j]

    return out

r = df_test.rank(1).values
df_test.T.corr('spearman') == c(r, r)

OLD ANSWER

Doing a Spearman Rank correlation is simply doing a correlation of the ranks.

Rank

We can leverage argsort to get ranks. Though the argsort of the argsort does get us the ranks, we can limit ourselves to one sort by slice assigning.

def rank(a):
  i, j = np.meshgrid(*map(np.arange, a.shape), indexing='ij')

  s = a.argsort(1)
  out = np.empty_like(s)
  out[i, s] = j

  return out

Correlation

In the case of correlating ranks, the means and standard deviations are all predetermined by the size of the second dimension of the array.

You can accomplish this same thing without numba, but I'm assuming you want it.

from numba import njit

@njit
def c(a, b):
  n, k = a.shape
  m, k = b.shape

  mu = (k - 1) / 2
  sig = ((k - 1) * (k + 1) / 12) ** .5

  out = np.empty((n, m))

  a = a - mu
  b = b - mu

  for i in range(n):
    for j in range(m):
      out[i, j] = a[i] @ b[j] / k / sig ** 2

  return out

For posterity, we could avoid the internal loop altogether but this might have memory issues.

@njit
def c1(a, b):
  n, k = a.shape
  m, k = b.shape

  mu = (k - 1) / 2
  sig = ((k - 1) * (k + 1) / 12) ** .5

  a = a - mu
  b = b - mu

  return a @ b.T / k / sig ** 2

Demonstration

np.random.seed([3, 1415])

a = np.random.randn(2, 10)
b = np.random.randn(2, 10)

rank_a = rank(a)
rank_b = rank(b)

c(rank_a, rank_b)

array([[0.32121212, 0.01818182],
       [0.13939394, 0.55151515]])

If you were working with DataFrame

da = pd.DataFrame(a)
db = pd.DataFrame(b)

pd.DataFrame(c(rank(da.values), rank(db.values)), da.index, db.index)


          0         1
0  0.321212  0.018182
1  0.139394  0.551515

Validation

We can do a quick validation using pandas.DataFrame.corr

pd.DataFrame(a.T).corr('spearman') == c(rank_a, rank_a)

      0     1
0  True  True
1  True  True

Here is a rowbased, uncompiled version of scipy.stats.spearmanr that uses ~ 5% of the time on large datasets with an example showing it produces the same result:

import numpy as np

import pandas as pd


def spearman_row(x, y):

    x = np.asarray(x)
    y = np.asarray(y)

    rx = rankdata_average(x)
    ry = rankdata_average(y)

    # print(rx)
    # print(ry)

    return compute_corr(rx, ry)

def compute_corr(x, y):

    # Thanks to https://github.com/dengemann

    def ss(a, axis):
        return np.sum(a * a, axis=axis)

    x = np.asarray(x)
    y = np.asarray(y)

    mx = x.mean(axis=-1)
    my = y.mean(axis=-1)

    xm, ym = x - mx[..., None], y - my[..., None]

    r_num = np.add.reduce(xm * ym, axis=-1)
    r_den = np.sqrt(ss(xm, axis=-1) * ss(ym, axis=-1))

    with np.errstate(divide='ignore', invalid="ignore"):

        r = r_num / r_den

    return r


def rankdata_average(data):

    """Row-based rankdata using method=mean"""

    dc = np.asarray(data).copy()
    sorter = np.apply_along_axis(np.argsort, 1, data)

    inv = np.empty(data.shape, np.intp)

    ranks = np.tile(np.arange(data.shape[1]), (len(data), 1))

    np.put_along_axis(inv, sorter, ranks, axis=1)

    dc = np.take_along_axis(dc, sorter, 1)

    res = np.apply_along_axis(lambda r: r[1:] != r[:-1], 1, dc)

    obs = np.column_stack([np.ones(len(res), dtype=bool), res])

    dense = np.take_along_axis(np.apply_along_axis(np.cumsum, 1, obs), inv, 1)

    len_r = obs.shape[1]

    nonzero = np.count_nonzero(obs, axis=1)
    obs = pd.DataFrame(obs)
    nonzero = pd.Series(nonzero)
    dense = pd.DataFrame(dense)

    ranks = []
    for _nonzero, nzdf in obs.groupby(nonzero, sort=False):

        nz = np.apply_along_axis(lambda r: np.nonzero(r)[0], 1, nzdf)

        _count = np.column_stack([nz, np.ones(len(nz)) * len_r])
        _dense = dense.reindex(nzdf.index).values

        _result = 0.5 * (np.take_along_axis(_count, _dense, 1) + np.take_along_axis(_count, _dense - 1, 1) + 1)

        result = pd.DataFrame(_result, index=nzdf.index)
        ranks.append(result)

    final = pd.concat(ranks).sort_index()

    return final


if __name__ == "__main__":

    from scipy.stats import rankdata, spearmanr
    from time import time

    np.random.seed(0)

    size = int(1e5), 5
    d1 = np.random.randint(5, size=size)
    d2 = np.random.randint(5, size=size)

    start = time()
    actual = spearman_row(d1, d2)
    end = time()
    print("actual", actual)
    print("rowbased took", end - start)

    start = time()
    expected = []
    for i in range(len(d1)):
        expected.append(spearmanr(d1[i], d2[i]).correlation)
    end = time()
    print("scipy took", end - start)

    expected = np.array(expected)

    print("largest diff", pd.Series(expected - actual).abs().max())

It prints:

rowbased took 3.6308434009552
scipy took 53.552557945251465
largest diff 2.220446049250313e-16 

Pandas has a corr function with the support of spearman . It works on columns, so we can just transpose the dataFrame.

We will append df1 to df2 and calculate the correlation by iterating over each row

len_df1 = df1.shape[0]
df2_index = df2.index.values.tolist()


df = df2.append(df1).reset_index(drop=True).T
values = {i: [df.iloc[:,df2_index+[i]].corr(method='spearman').values] for i in range(len_df1)}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM