两个P值的熊猫数据帧的相关矩阵

Question

I was using this function (see bottom) to calculate both Pearson and Pval starting from two dataframes, but I am not confident with Pval results: it seems that too many negative correlations are significant. 我正在使用此函数（请参阅底部）从两个数据帧开始计算Pearson和Pval，但是我对Pval的结果不满意：似乎有太多的负相关性很重要。

Is there a more elegant way (like one-line-code), in order to calculate Pval along with Pearson? 为了与Pearson一起计算Pval，是否有更优雅的方法（如单行代码）？

These two answers ( pandas.DataFrame corrwith() method ) and ( correlation matrix of one dataframe with another ) provided elegant solutions, but P values calculation is missing. 这两个答案（ pandas.DataFrame corrwith（）方法）和（一个数据帧与另一个数据帧的相关矩阵）提供了很好的解决方案，但是缺少了P值计算。

Here is the code: 这是代码：

def pearson_cross_map(df1, df2):
    """Correlate each Mvar with each Nvar.

    Parameters
    ----------
    df1 : dataframe1
    Shape Mobs X Mvar.

    df2 : dataframe2
    Shape Nobs X Nvar.

    Returns
    -------
    DFcorr, dataframe Mvar x Nvar in which each element is a Pearson 
correlation coefficient.
    DFpval, dataframe Mvar x Nvar in which each element is a P value (one-tailed).

    """

    intersection = (df1.index & df2.index).tolist()
    df1 = df1.convert_objects(convert_numeric=True) 
    df1 = df1.T[intersection].T 
    df1 = df1.loc[:, (df1 != 0).any(axis=0)].sort().sort(axis=1)    
    df2 = df2.convert_objects(convert_numeric=True)
    df2 = df2.T[intersection].T
    df2 = df2.loc[:, (df2 != 0).any(axis=0)].sort().sort(axis=1)
    x = df1.T.values
    y = df2.T.values
    mu_x = x.mean(1)
    mu_y = y.mean(1)
    n = x.shape[1]
    s_x = x.std(1, ddof=n - 1)
    s_y = y.std(1, ddof=n - 1)
    cov = np.dot(x,y.T) - n * np.dot(mu_x[:, np.newaxis], mu_y[np.newaxis, :])
    DFcoeff = pd.DataFrame(cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :]))
    DFcoeff.index = df1.columns.tolist()
    DFcoeff.columns = df2.columns.tolist()
    n = len(intersection)
    r = DFcoeff
    t = r*np.sqrt((n-2)/(1-r*r))
    DFpval = pd.DataFrame(stats.t.cdf(t, n-2))
    DFpval.index = df1.columns.tolist()
    DFpval.columns = df2.columns.tolist()
    return DFcoeff, DFpval

Thank you! 谢谢！

Answer 1

You require Pearson correlation testing and not just correlation calculation. 您需要进行Pearson相关性测试，而不仅仅是相关性计算。 Hence, use the scipy.stats.pearsonr method which returns the estimated Pearson coefficient and 2-tailed pvalue. 因此，使用scipy.stats.pearsonr方法返回估计的Pearson系数和2尾pvalue。

Since the method requires a series input, consider iterating through each column of both dataframes to update pre-assigned matrices. 由于该方法需要一系列输入，因此请考虑遍历两个数据帧的每一列以更新预先分配的矩阵。 Even cast to dataframe with needed columns and index: 甚至强制转换为具有所需列和索引的数据框：

import numpy as np
import pandas as pd
from scipy.stats import pearsonr

df1 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
df2 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

coeffmat = np.zeros((df1.shape[1], df2.shape[1]))
pvalmat = np.zeros((df1.shape[1], df2.shape[1]))

for i in range(df1.shape[1]):    
    for j in range(df2.shape[1]):        
        corrtest = pearsonr(df1[df1.columns[i]], df2[df2.columns[j]])  

        coeffmat[i,j] = corrtest[0]
        pvalmat[i,j] = corrtest[1]

dfcoeff = pd.DataFrame(coeffmat, columns=df2.columns, index=df1.columns)
print(dfcoeff)
#           Col1      Col2      Col3      Col4      Col5
# Col1 -0.791083  0.459101 -0.488463 -0.289265  0.494897
# Col2  0.059446 -0.395072  0.310900  0.297532  0.201669
# Col3 -0.062592  0.391469 -0.450600 -0.136554  0.299579
# Col4 -0.470203  0.797971 -0.193561 -0.338896 -0.244132
# Col5 -0.057848 -0.037053  0.042798  0.176966 -0.157344

dfpvals = pd.DataFrame(pvalmat, columns=df2.columns, index=df1.columns)
print(dfpvals)
#           Col1      Col2      Col3      Col4      Col5
# Col1  0.006421  0.181967  0.152007  0.417574  0.145871
# Col2  0.870421  0.258506  0.381919  0.403770  0.576357
# Col3  0.863615  0.263268  0.191245  0.706796  0.400385
# Col4  0.170260  0.005666  0.592096  0.338101  0.496668
# Col5  0.873881  0.919058  0.906551  0.624783  0.664206

Answer 2

You could compare this with bootstrap significance (ie if you shuffle randomly one series, what is the probability that you will get the same or greater correlation). 您可以将其与自举意义进行比较（即，如果您随机地随机播放一个系列，则获得相同或更大相关性的概率是多少）。 This is not the same thing as Pearson's p-value as the latter was derived with assumption that your data is normally distributed, so you could get somewhat different result if it is not the case. 这与Pearson的p值不同，后者是在假设您的数据呈正态分布的情况下得出的，因此，如果不是这种情况，您可能会得到一些不同的结果。

bootstrapLen = 1000
leng= 10000
X, Y= [np.random.randn(leng) for _ in [1,2]]
correlation = np.correlate(X,Y)/leng

bootstrap = [ abs(np.correlate(X,Y[np.random.permutation(leng)])/leng) for _ in range(bootstrapLen)]
bootstrap = np.sort(np.ravel(bootstrap))
significance = np.searchsorted(bootstrap, abs(correlation)) / bootstrapLen

print("correlation is {} with significance {}".format(correlation,significance))

两个P值的熊猫数据帧的相关矩阵

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-03-19 14:24:56

解决方案2
0 2017-03-19 10:19:40

两个P值的熊猫数据帧的相关矩阵

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-03-19 14:24:56

解决方案2 0 2017-03-19 10:19:40

解决方案1
4 已采纳 2017-03-19 14:24:56

解决方案2
0 2017-03-19 10:19:40