简体   繁体   English

两个数据帧的所有列对之间的列相关性

[英]column-wise correlation between all pairs of columns of two data frame

Hi so I have created a function to check the correlation between 2 variables, anyone knows how can I create a new data frame from this?嗨,所以我创建了一个 function 来检查 2 个变量之间的相关性,有人知道如何从中创建一个新的数据框吗?

In [1]:from scipy.stats import pearsonr
for colY in Y.columns:
    for colX in X.columns:
    #print('Pearson Correlation')
        corr, _ = pearsonr(numerical_cols_target[colX], numerical_cols_target[colY])
        alpha = 0.05
        print('Pearson Correlation', (alpha, corr))
        if corr <= alpha:
            print(colX +' and ' +colY+ ' two ariables are not correlated ')
        else:
            print(colX +' and ' +colY+ ' two variables are highly correlated ')
        print('\n')
    print('\n')

here's a sample output from the correlation function:这是来自相关 function 的示例 output:

Out [1]: 
Pearson Correlation (0.05, -0.1620045985125294)
banana and orange are not correlated 

Pearson Correlation (0.05, 0.2267582070839226)
apple and orange are highly correlated
```

I would avoid using two for loops.我会避免使用两个 for 循环。 Depending on the size of your dataset this will be very slow.根据数据集的大小,这将非常慢。

Pandas provides a correlation function with might come in hand here: Pandas 提供了与 function 的相关性,可能会在这里出现:

import pandas as pd

df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

using corr() will give you the pairwise correlations then and returns a new dataframe as well :然后使用 corr() 将为您提供成对相关性并返回一个新的 dataframe

df.corr()

For more infos you can check the manual: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html有关更多信息,您可以查看手册: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.ZFC35FDC70D5FC69D269883A8

You can just do the following.您可以执行以下操作。

df = pd.DataFrame(index=X.columns, columns=Y.columns)

#In your loop
df[colY][colX] = corr

Your loop would then be你的循环将是

for colY in Y.columns:
    for colX in X.columns:
    #print('Pearson Correlation')
        corr, _ = pearsonr(numerical_cols_target[colX], numerical_cols_target[colY])
        alpha = 0.05
        print('Pearson Correlation', (alpha, corr))
        df[colY][colX] = corr
        if corr <= alpha:
            print(colX +' and ' +colY+ ' two ariables are not correlated ')
        else:
            print(colX +' and ' +colY+ ' two variables are highly correlated ')
        print('\n')
    print('\n')

I think you are looking for this: This will get a column-wise correlation of every two pairs of columns between X and Y dataframes and create another dataframe that keeps all the correlations and whether they pass a threshold alpha: This assumes Y has less or equal number of columns as X. If not simply switch X and Y places:我认为您正在寻找这个:这将获得 X 和 Y 数据帧之间每两对列的列相关性,并创建另一个 dataframe 保持所有相关性以及它们是否通过阈值 alpha:假设 Y 具有更少或与 X 相同的列数。如果不是简单地切换 X 和 Y 位置:

import collections
corr_df = pd.DataFrame(columns=['col_X', 'col_Y', 'corr', 'is_correlated'])
d = collections.deque(X.columns)
Y_cols = Y.columns
alpha = 0.05
for i in range(len(d)):
  d.rotate(i)
  X = X[d]
  corr = Y.corrwith(X, axis=0)
  corr_df = corr_df.append(pd.DataFrame({'col_X':list(d)[:len(Y_cols)], 'col_Y':Y.columns, 'corr':corr[:len(Y_cols)], 'is_correlated':corr[:len(Y_cols)]>alpha}))
print(corr_df.reset_index())

sample input and output:样本输入和 output:

X:
   A  B   C
0  2  2  10
1  4  0   2
2  8  0   1
3  0  0   8

Y:
   B   C
0  2  10
1  0   2
2  0   1
3  0   8


correlation(X, Y):

  col_X col_Y  corr is_correlated
0     A     B   1.0          True
1     B     C   1.0          True
2     C     B   1.0          True
3     A     C   1.0          True
4     A     B   1.0          True
5     B     C   1.0          True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM