简体   繁体   English


[英]Correlation matrix of two Pandas dataframe, with P values

I was using this function (see bottom) to calculate both Pearson and Pval starting from two dataframes, but I am not confident with Pval results: it seems that too many negative correlations are significant. 我正在使用此函数(请参阅底部)从两个数据帧开始计算Pearson和Pval,但是我对Pval的结果不满意:似乎有太多的负相关性很重要。

Is there a more elegant way (like one-line-code), in order to calculate Pval along with Pearson? 为了与Pearson一起计算Pval,是否有更优雅的方法(如单行代码)?

These two answers ( pandas.DataFrame corrwith() method ) and ( correlation matrix of one dataframe with another ) provided elegant solutions, but P values calculation is missing. 这两个答案( pandas.DataFrame corrwith()方法 )和( 一个数据帧与另一个数据帧的相关矩阵 )提供了很好的解决方案,但是缺少了P值计算。

Here is the code: 这是代码:

def pearson_cross_map(df1, df2):
    """Correlate each Mvar with each Nvar.

    df1 : dataframe1
    Shape Mobs X Mvar.

    df2 : dataframe2
    Shape Nobs X Nvar.

    DFcorr, dataframe Mvar x Nvar in which each element is a Pearson 
correlation coefficient.
    DFpval, dataframe Mvar x Nvar in which each element is a P value (one-tailed).


    intersection = (df1.index & df2.index).tolist()
    df1 = df1.convert_objects(convert_numeric=True) 
    df1 = df1.T[intersection].T 
    df1 = df1.loc[:, (df1 != 0).any(axis=0)].sort().sort(axis=1)    
    df2 = df2.convert_objects(convert_numeric=True)
    df2 = df2.T[intersection].T
    df2 = df2.loc[:, (df2 != 0).any(axis=0)].sort().sort(axis=1)
    x = df1.T.values
    y = df2.T.values
    mu_x = x.mean(1)
    mu_y = y.mean(1)
    n = x.shape[1]
    s_x = x.std(1, ddof=n - 1)
    s_y = y.std(1, ddof=n - 1)
    cov = np.dot(x,y.T) - n * np.dot(mu_x[:, np.newaxis], mu_y[np.newaxis, :])
    DFcoeff = pd.DataFrame(cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :]))
    DFcoeff.index = df1.columns.tolist()
    DFcoeff.columns = df2.columns.tolist()
    n = len(intersection)
    r = DFcoeff
    t = r*np.sqrt((n-2)/(1-r*r))
    DFpval = pd.DataFrame(stats.t.cdf(t, n-2))
    DFpval.index = df1.columns.tolist()
    DFpval.columns = df2.columns.tolist()
    return DFcoeff, DFpval

Thank you! 谢谢!

You require Pearson correlation testing and not just correlation calculation. 您需要进行Pearson相关性测试,而不仅仅是相关性计算。 Hence, use the scipy.stats.pearsonr method which returns the estimated Pearson coefficient and 2-tailed pvalue. 因此,使用scipy.stats.pearsonr方法返回估计的Pearson系数和2尾pvalue。

Since the method requires a series input, consider iterating through each column of both dataframes to update pre-assigned matrices. 由于该方法需要一系列输入,因此请考虑遍历两个数据帧的每一列以更新预先分配的矩阵。 Even cast to dataframe with needed columns and index: 甚至强制转换为具有所需列和索引的数据框:

import numpy as np
import pandas as pd
from scipy.stats import pearsonr

df1 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
df2 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

coeffmat = np.zeros((df1.shape[1], df2.shape[1]))
pvalmat = np.zeros((df1.shape[1], df2.shape[1]))

for i in range(df1.shape[1]):    
    for j in range(df2.shape[1]):        
        corrtest = pearsonr(df1[df1.columns[i]], df2[df2.columns[j]])  

        coeffmat[i,j] = corrtest[0]
        pvalmat[i,j] = corrtest[1]

dfcoeff = pd.DataFrame(coeffmat, columns=df2.columns, index=df1.columns)
#           Col1      Col2      Col3      Col4      Col5
# Col1 -0.791083  0.459101 -0.488463 -0.289265  0.494897
# Col2  0.059446 -0.395072  0.310900  0.297532  0.201669
# Col3 -0.062592  0.391469 -0.450600 -0.136554  0.299579
# Col4 -0.470203  0.797971 -0.193561 -0.338896 -0.244132
# Col5 -0.057848 -0.037053  0.042798  0.176966 -0.157344

dfpvals = pd.DataFrame(pvalmat, columns=df2.columns, index=df1.columns)
#           Col1      Col2      Col3      Col4      Col5
# Col1  0.006421  0.181967  0.152007  0.417574  0.145871
# Col2  0.870421  0.258506  0.381919  0.403770  0.576357
# Col3  0.863615  0.263268  0.191245  0.706796  0.400385
# Col4  0.170260  0.005666  0.592096  0.338101  0.496668
# Col5  0.873881  0.919058  0.906551  0.624783  0.664206

You could compare this with bootstrap significance (ie if you shuffle randomly one series, what is the probability that you will get the same or greater correlation). 您可以将其与自举意义进行比较(即,如果您随机地随机播放一个系列,则获得相同或更大相关性的概率是多少)。 This is not the same thing as Pearson's p-value as the latter was derived with assumption that your data is normally distributed, so you could get somewhat different result if it is not the case. 这与Pearson的p值不同,后者是在假设您的数据呈正态分布的情况下得出的,因此,如果不是这种情况,您可能会得到一些不同的结果。

bootstrapLen = 1000
leng= 10000
X, Y= [np.random.randn(leng) for _ in [1,2]]
correlation = np.correlate(X,Y)/leng

bootstrap = [ abs(np.correlate(X,Y[np.random.permutation(leng)])/leng) for _ in range(bootstrapLen)]
bootstrap = np.sort(np.ravel(bootstrap))
significance = np.searchsorted(bootstrap, abs(correlation)) / bootstrapLen

print("correlation is {} with significance {}".format(correlation,significance))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM