简体   繁体   中英

Pandas expanding/rolling window correlation calculation with p-value

Suppose I have a DataFrame, on which I want to calculate rolling or expanding Pearson correlations between two columns

import numpy as np
import pandas as pd
import scipy.stats as st


df = pd.DataFrame({'x': np.random.rand(10000), 'y': np.random.rand(10000)})

With the inbuilt pandas functionality it's very fast to calculate this

expanding_corr = df['x'].expanding(50).corr(df['y'])
rolling_corr = df['x'].rolling(50).corr(df['y'])

However, if I wish to get the p-values associated with these correlations the best I could do was define a custom rolling function and pass apply to groupby object

def custom_roll(df, w, **kwargs):

    v = df.values
    d0, d1 = v.shape
    s0, s1 = v.strides
    a = np.lib.stride_tricks.as_strided(v, (d0 - (w - 1), w, d1), (s0, s0, s1))
    rolled_df = pd.concat({
        row: pd.DataFrame(values, columns=df.columns)
        for row, values in zip(df.index[(w-1):], a)
    })
    return rolled_df.groupby(level=0, **kwargs)

c_df = custom_roll(df, 50).apply(lambda df: st.pearsonr(df['x'], df['y']))

c_df now contains the appropriate correlations and importantly their associated p-values.

However, this method is extremely slow compared to the inbuilt pandas method, which means it is not suitable, as practically I am calculating these correlations thousands of times during an optimisation process. Furthermore, I am unsure how to extend the custom_roll function to work for expanding windows.

Can anyone point me in the direction of leveraging numpy to get the p-values over expanding windows at vectorised speeds?

I could not think of a clever way to do this in pandas using rolling directly, but note that you can calculate the p-value given the correlation coefficient.

Pearson's correlation coefficient follows Student's t-distribution and you can get the p-value by plugging it to the cdf defined by the incomplete beta function, scipy.special.betainc . It sounds complicated but can be done in a few lines of code. Below is a function that computes the p-value given the correlation coefficient corr and the sample size n . It is actually based on scipy's implementation you have been using.

import pandas as pd
from scipy.special import betainc

def pvalue(corr, n=50):
    df = n - 2
    t_squared = corr**2 * (df / ((1.0 - corr) * (1.0 + corr)))
    prob = betainc(0.5*df, 0.5, df/(df+t_squared))
    return prob

You can then apply this function to the correlation values you already have.

rolling_corr = df['x'].rolling(50).corr(df['y'])
pvalue(rolling_corr)

It might not be the perfect vectorised numpy solution but should be tens of times faster than calculating the correlations over and over again.

Approach #1

corr2_coeff_rowwise lists how to do element-wise correlation between rows. We could strip it down to a case for element-wise correlation between two columns. So, we would end up with a loop that uses corr2_coeff_rowwise . Then, we would try to vectorize it and see that there are pieces in it that could be vectorized :

  1. Getting average values with mean . This could be vectorized with use of uniform filter.
  2. Next up was getting the differences between those average values against sliding elements from input arrays. To port to a vectorized one, we would make use of broadcasting .

Rest stays the same to get the first off the two outputs from pearsonr .

To get the second output, we go back to the source code . This should be straight-forward given the first coefficient output.

So, with those in mind, we would end up with something like this -

import scipy.special as special
from scipy.ndimage import uniform_filter

def sliding_corr1(a,b,W):
    # a,b are input arrays; W is window length

    am = uniform_filter(a.astype(float),W)
    bm = uniform_filter(b.astype(float),W)

    amc = am[W//2:-W//2+1]
    bmc = bm[W//2:-W//2+1]

    da = a[:,None]-amc
    db = b[:,None]-bmc

    # Get sliding mask of valid windows
    m,n = da.shape
    mask1 = np.arange(m)[:,None] >= np.arange(n)
    mask2 = np.arange(m)[:,None] < np.arange(n)+W
    mask = mask1 & mask2
    dam = (da*mask)
    dbm = (db*mask)

    ssAs = np.einsum('ij,ij->j',dam,dam)
    ssBs = np.einsum('ij,ij->j',dbm,dbm)
    D = np.einsum('ij,ij->j',dam,dbm)
    coeff = D/np.sqrt(ssAs*ssBs)

    n = W
    ab = n/2 - 1
    pval = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(coeff))))
    return coeff,pval

Thus, to get the final output off the inputs from the pandas series -

out = sliding_corr1(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)

Approach #2

A lot similar to Approach #1 , but we will use numba to improve on the memory efficiency to replace step #2 from previous approach.

from numba import njit
import math

@njit(parallel=True)
def sliding_corr2_coeff(a,b,amc,bmc):
    L = len(a)-W+1
    out00 = np.empty(L)
    for i in range(L):
        out_a = 0
        out_b = 0
        out_D = 0
        for j in range(W):
            d_a = a[i+j]-amc[i]
            d_b = b[i+j]-bmc[i]
            out_D += d_a*d_b
            out_a += d_a**2
            out_b += d_b**2
        out00[i] = out_D/math.sqrt(out_a*out_b)
    return out00

def sliding_corr2(a,b,W):
    am = uniform_filter(a.astype(float),W)
    bm = uniform_filter(b.astype(float),W)

    amc = am[W//2:-W//2+1]
    bmc = bm[W//2:-W//2+1]

    coeff = sliding_corr2_coeff(a,b,amc,bmc)

    ab = W/2 - 1
    pval = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(coeff))))
    return coeff,pval

Approach #3

Very similar to previous one, except that we are pushing all of the coefficient work to numba -

@njit(parallel=True)
def sliding_corr3_coeff(a,b,W):
    L = len(a)-W+1
    out00 = np.empty(L)
    for i in range(L):
        a_mean = 0.0
        b_mean = 0.0
        for j in range(W):
            a_mean += a[i+j]
            b_mean += b[i+j]
        a_mean /= W
        b_mean /= W

        out_a = 0
        out_b = 0
        out_D = 0
        for j in range(W):
            d_a = a[i+j]-a_mean
            d_b = b[i+j]-b_mean
            out_D += d_a*d_b
            out_a += d_a*d_a
            out_b += d_b*d_b
        out00[i] = out_D/math.sqrt(out_a*out_b)
    return out00

def sliding_corr3(a,b,W):    
    coeff = sliding_corr3_coeff(a,b,W)
    ab = W/2 - 1
    pval = 2*special.btdtr(ab, ab, 0.5*(1 - np.abs(coeff)))
    return coeff,pval

Timings -

In [181]: df = pd.DataFrame({'x': np.random.rand(10000), 'y': np.random.rand(10000)})

In [182]: %timeit sliding_corr2(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)
100 loops, best of 3: 5.05 ms per loop

In [183]: %timeit sliding_corr3(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)
100 loops, best of 3: 5.51 ms per loop

Note :

  • sliding_corr1 seems to be taking long on this dataset and most probably because of the memory-requirement from its step#2.

  • The bottleneck after using the numba funcs, then transfers to the p-val computation with special.btdtr .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM