列出 Pandas 中大型相关矩阵中的最高相关对？

Question

你如何在 Pandas 的相关矩阵中找到最高的相关性？ 关于如何使用 R（ Show correlations as an ordered list, not as a large matrix or Efficient way to get highly correlated pairs from large data set in Python or R ）有很多答案，但我想知道如何去做pandas？ 在我的例子中，矩阵是 4460x4460，所以不能在视觉上做到这一点。

Answer 1

您可以使用DataFrame.values获取数据的 numpy 数组，然后使用argsort()等 NumPy 函数获取最相关的对。

但是，如果你想这样做的熊猫，你可以unstack和排序数据框：

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

这是输出：

2192  1522    0.636198
1522  2192    0.636198
3677  2027    0.641817
2027  3677    0.641817
242   130     0.646760
130   242     0.646760
1171  2733    0.670048
2733  1171    0.670048
1000  2000    0.742340
2000  1000    0.742340
dtype: float64

Answer 2

@HYRY 的回答是完美的。 只是通过添加更多逻辑来避免重复和自相关以及正确排序来构建该答案：

import pandas as pd
d = {'x1': [1, 4, 4, 5, 6], 
     'x2': [0, 0, 8, 2, 4], 
     'x3': [2, 8, 8, 10, 12], 
     'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()

print("Correlation Matrix")
print(df.corr())
print()

def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

这给出了以下输出：

Data Frame
   x1  x2  x3  x4
0   1   0   2  -1
1   4   0   8  -4
2   4   8   8  -4
3   5   2  10  -4
4   6   4  12  -5

Correlation Matrix
          x1        x2        x3        x4
x1  1.000000  0.399298  1.000000 -0.969248
x2  0.399298  1.000000  0.399298 -0.472866
x3  1.000000  0.399298  1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248  1.000000

Top Absolute Correlations
x1  x3    1.000000
x3  x4    0.969248
x1  x4    0.969248
dtype: float64

Answer 3

没有冗余变量对的几行解决方案：

corr_matrix = df.corr().abs()

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)

sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                  .stack()
                  .sort_values(ascending=False))

#first element of sol series is the pair with the biggest correlation

然后，您可以遍历变量对的名称（它们是 pandas.Series 多索引）及其值，如下所示：

for index, value in sol.items():
  # do some staff

Answer 4

结合@HYRY 和@arun 的答案的一些功能，您可以使用以下方法在一行中打印数据帧df的最高相关性：

df.corr().unstack().sort_values().drop_duplicates()

注意：一个缺点是如果你有 1.0 相关性不是一个变量，那么drop_duplicates()添加会删除它们

Answer 5

使用下面的代码按降序查看相关性。

# See the correlations in descending order

corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

Answer 6

您可以根据这个简单的代码通过替换您的数据以图形方式进行。

corr = df.corr()

kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")

Answer 7

我最喜欢 Addison Klinke 的帖子，因为它是最简单的，但使用了 Wojciech Moszczyńsk 的过滤和图表建议，但扩展了过滤器以避免绝对值，因此给定一个大的相关矩阵，过滤它，绘制它，然后将其展平：

创建、过滤和绘制

dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()

功能

最后，我创建了一个小函数来创建相关矩阵，对其进行过滤，然后将其展平。 作为一个想法，它可以很容易地扩展，例如，不对称的上下界等。

def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

corrFilter(df, .7)

跟进

最后，我完善了功能

# Returns correlation matrix
def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    return xFiltered

# flattens correlation matrix with bounds
def corrFilterFlattened(x: pd.DataFrame, bound: float):
    xFiltered = corrFilter(x, bound)
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

# Returns correlation for a variable from flattened correlation matrix
def filterForLabels(df: pd.DataFrame, label):  
    try:
        sideLeft = df[label,]
    except:
        sideLeft = pd.DataFrame()

    try:
        sideRight = df[:,label]
    except:
        sideRight = pd.DataFrame()

    if sideLeft.empty and sideRight.empty:
        return pd.DataFrame()
    elif sideLeft.empty:        
        concat = sideRight.to_frame()
        concat.rename(columns={0:'Corr'},inplace=True)
        return concat
    elif sideRight.empty:
        concat = sideLeft.to_frame()
        concat.rename(columns={0:'Corr'},inplace=True)
        return concat
    else:
        concat = pd.concat([sideLeft,sideRight], axis=1)
        concat["Corr"] = concat[0].fillna(0) + concat[1].fillna(0)
        concat.drop(columns=[0,1], inplace=True)
        return concat

Answer 8

使用itertools.combinations从.corr()自己的相关矩阵.corr()获取所有唯一相关性，生成列表列表并将其反馈到 DataFrame 中以使用 '.sort_values'。 设置ascending = True以在顶部显示最低相关性

corrank将 DataFrame 作为参数，因为它需要.corr() 。

  def corrank(X: pandas.DataFrame):
        import itertools
        df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])    
        print(df.sort_values(by='corr',ascending=False))

  corrank(X) # prints a descending list of correlation pair (Max on top)

Answer 9

这里有很多很好的答案。 我发现的最简单的方法是结合上面的一些答案。

corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
    .sort_values(by='column', ascending=False)\
    .dropna()

Answer 10

将上面的大多数答案组合成一个简短的片段：

def top_entries(df):
    mat = df.corr().abs()
    
    # Remove duplicate and identity entries
    mat.loc[:,:] = np.tril(mat.values, k=-1)
    mat = mat[mat>0]

    # Unstack, sort ascending, and reset the index, so features are in columns
    # instead of indexes (allowing e.g. a pretty print in Jupyter).
    # Also rename these it for good measure.
    return (mat.unstack()
             .sort_values(ascending=False)
             .reset_index()
             .rename(columns={
                 "level_0": "feature_a",
                 "level_1": "feature_b",
                 0: "correlation"
             }))

Answer 11

我不想把这个问题unstack或过度复杂化，因为我只是想删除一些高度相关的特征作为特征选择阶段的一部分。

所以我最终得到了以下简化的解决方案：

# map features to their absolute correlation values
corr = features.corr().abs()

# set equality (self correlation) as zero
corr[corr == 1] = 0

# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)

# display the highly correlated features
display(corr_cols[corr_cols > 0.8])

在这种情况下，如果您想删除相关特征，您可以映射过滤后的corr_cols数组并删除奇数索引（或偶数索引）的。

Answer 12

我在这里尝试了一些解决方案，但后来我实际上想出了自己的解决方案。 我希望这对下一个有用，所以我在这里分享：

def sort_correlation_matrix(correlation_matrix):
    cor = correlation_matrix.abs()
    top_col = cor[cor.columns[0]][1:]
    top_col = top_col.sort_values(ascending=False)
    ordered_columns = [cor.columns[0]] + top_col.index.tolist()
    return correlation_matrix[ordered_columns].reindex(ordered_columns)

Answer 13

这是@MiFi 的改进代码。 这一个以 abs 为单位的顺序，但不排除负值。

   def top_correlation (df,n):
    corr_matrix = df.corr()
    correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                 .stack()
                 .sort_values(ascending=False))
    correlation = pd.DataFrame(correlation).reset_index()
    correlation.columns=["Variable_1","Variable_2","Correlacion"]
    correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
    return correlation.head(n)

top_correlation(ANYDATA,10)

Answer 14

以下函数应该可以解决问题。 这个实现

去除自相关
删除重复项
启用前 N 个相关性最高的特征的选择

并且它也是可配置的，因此您可以同时保留自相关和重复项。 您还可以根据需要报告任意数量的特征对。

def get_feature_correlation(df, top_n=None, corr_method='spearman',
                            remove_duplicates=True, remove_self_correlations=True):
    """
    Compute the feature correlation and sort feature pairs based on their correlation

    :param df: The dataframe with the predictor variables
    :type df: pandas.core.frame.DataFrame
    :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
    :param corr_method: Correlation compuation method
    :type corr_method: str
    :param remove_duplicates: Indicates whether duplicate features must be removed
    :type remove_duplicates: bool
    :param remove_self_correlations: Indicates whether self correlations will be removed
    :type remove_self_correlations: bool

    :return: pandas.core.frame.DataFrame
    """
    corr_matrix_abs = df.corr(method=corr_method).abs()
    corr_matrix_abs_us = corr_matrix_abs.unstack()
    sorted_correlated_features = corr_matrix_abs_us \
        .sort_values(kind="quicksort", ascending=False) \
        .reset_index()

    # Remove comparisons of the same feature
    if remove_self_correlations:
        sorted_correlated_features = sorted_correlated_features[
            (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
        ]

    # Remove duplicates
    if remove_duplicates:
        sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]

    # Create meaningful names for the columns
    sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']

    if top_n:
        return sorted_correlated_features[:top_n]

    return sorted_correlated_features

Answer 15

越简单越好

from collections import defaultdict
res = defaultdict(dict)
corr = returns.corr().replace(1, -1)
names = list(corr)

for name in names:
    idx = corr[name].argmax()
    max_pairwise_name = names[idx]
    res[name][max_pairwise_name] = corr.loc[max_pairwisename, name]

现在 res 包含每对的最大成对相关性

列出 Pandas 中大型相关矩阵中的最高相关对？

问题描述

15 个解决方案

解决方案1
109 已采纳 2013-07-22 01:43:58

解决方案2
55 2017-01-03 23:15:39

解决方案3
47 2017-03-28 15:30:17

解决方案4
17 2018-06-27 21:24:54

解决方案5
11 2018-04-07 18:18:27

解决方案6
10 2020-03-27 08:16:33

解决方案7
9 2020-08-22 12:51:08

解决方案8
2 2017-09-22 10:46:29

解决方案9
2 2019-03-10 02:16:34

解决方案10
2 2021-02-08 20:56:16

解决方案11
1 2019-10-07 16:03:56

解决方案12
0 2019-10-16 13:25:31

解决方案13
0 2020-01-23 12:08:16

解决方案14
0 2020-04-01 17:12:52

解决方案15
0 2022-12-01 03:28:04

列出 Pandas 中大型相关矩阵中的最高相关对？

问题描述

15 个解决方案

解决方案1 109 已采纳 2013-07-22 01:43:58

解决方案2 55 2017-01-03 23:15:39

解决方案3 47 2017-03-28 15:30:17

解决方案4 17 2018-06-27 21:24:54

解决方案5 11 2018-04-07 18:18:27

解决方案6 10 2020-03-27 08:16:33

解决方案7 9 2020-08-22 12:51:08

解决方案8 2 2017-09-22 10:46:29

解决方案9 2 2019-03-10 02:16:34

解决方案10 2 2021-02-08 20:56:16

解决方案11 1 2019-10-07 16:03:56

解决方案12 0 2019-10-16 13:25:31

解决方案13 0 2020-01-23 12:08:16

解决方案14 0 2020-04-01 17:12:52

解决方案15 0 2022-12-01 03:28:04

解决方案1
109 已采纳 2013-07-22 01:43:58

解决方案2
55 2017-01-03 23:15:39

解决方案3
47 2017-03-28 15:30:17

解决方案4
17 2018-06-27 21:24:54

解决方案5
11 2018-04-07 18:18:27

解决方案6
10 2020-03-27 08:16:33

解决方案7
9 2020-08-22 12:51:08

解决方案8
2 2017-09-22 10:46:29

解决方案9
2 2019-03-10 02:16:34

解决方案10
2 2021-02-08 20:56:16

解决方案11
1 2019-10-07 16:03:56

解决方案12
0 2019-10-16 13:25:31

解决方案13
0 2020-01-23 12:08:16

解决方案14
0 2020-04-01 17:12:52

解决方案15
0 2022-12-01 03:28:04