简体   繁体   English

如何使用 python pandas 打印相关特征?

[英]How to print correlated features using python pandas?

I'm trying to get some information on the correlation of the independent variables.我正在尝试获取有关自变量相关性的一些信息。

My dataset has a lot of variables, therefore the heatmap is not solution, it is very unreadable.我的数据集有很多变量,因此热图不是解决方案,它非常不可读。

Currently, I have made a function that returns only those variables that are highly correlated.目前,我制作了一个 function,它只返回那些高度相关的变量。 I would like to change it in way to indicate pairs of correlated features.我想改变它以指示成对的相关特征。

The rest of the explanations below: rest的解释如下:

def find_correlated_features(df, threshold, target_variable):

    df_1 = df.drop(target_variable)

    #corr_matrix has in index and columns names of variables
    corr_matrix = df_1.corr().abs()

    # I'm taking only half of this matrix to prevent doubling results
    half_of_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

    # This prints list of columns which are correlated 
    to_drop = [column for column in half_of_matrix.columns if any(half_of_matrix[column] > threshold)]
    
    return to_drop 

The best if this function would return pandas dataframe with column_1;如果此 function 将返回 pandas dataframe 和 column_1 ,则最好; column_2;列_2; corr_coef only variables that are above threshold. corr_coef 仅高于阈值的变量。

Something like this:像这样的东西:

output = {'feature name 1': column_name,
          'feature name 2': index,
          'correlation coef': corr_coef}

output_list.append(output)
return pd.DataFrame(output_list).sort_values('corr_coef', ascending=False)

This should match the output you're looking for:这应该与您正在寻找的 output 匹配:

import pandas as pd
import numpy as np

# Create fake correlation matrix
corr_matrix = np.random.random_sample((5, 5))


ii, jj = np.triu_indices(corr_matrix.shape[0], 1)

scores = []

for i, j in zip(ii, jj):
    scores.append((i,j,corr_matrix[i,j]))

df_out = pd.DataFrame(data=scores,columns=['feature name 1','feature name 2','corr_coef'])\
           .sort_values('corr_coef', ascending=False)\
           .reset_index(drop=True)

threshold = 0.1

df_out[df_out['corr_coef'] > threshold]

#  feature name 1 feature name 2    corr_coef
# 0     0              2            0.990691
# 1     2              4            0.990444
# 2     0              1            0.830640
# 3     1              2            0.623895
# 4     1              4            0.433258
# 5     3              4            0.404395
# 6     0              4            0.291564
# 7     2              3            0.276799
# 8     1              3            0.177519

And you can map the indices of the features (in columns feature name 1 and feature name 2 above) to the columns of your df_1 to get the actual feature names.您可以 map 将功能的索引(在上面的feature name 1feature name 2列中)到您的 df_1 的列中,以获取实际的功能名称。

So your complete function would look like this:所以你完整的 function 看起来像这样:

def find_correlated_features(df, threshold, target_variable):

    df_1 = df.drop(target_variable)

    #corr_matrix has in index and columns names of variables
    corr_matrix = df_1.corr().abs().to_numpy()

    ii, jj = np.triu_indices(corr_matrix.shape[0], 1)

    scores = []

    for i, j in zip(ii, jj):
        scores.append((i,j,corr_matrix[i,j]))

    df_out = pd.DataFrame(data=scores,columns=['feature name 1','feature name 2','corr_coef'])\
               .sort_values('corr_coef', ascending=False)\
               .reset_index(drop=True)

    # This should go from the second column as the index loop that gave us
    # the scores and indices were from the upper triangle offset by 1
    feature_name_map = {i:c for i,c in enumerate(df_1.columns[1:])}

    df_out['feature name 1'] = df_out['feature name 1'].map(feature_name_map)
    df_out['feature name 2'] = df_out['feature name 2'].map(feature_name_map)

    return df_out[df_out['corr_coef'] > threshold] 

After Edit:编辑后:

After OP comment and @user6386471 answer, I've read again the question and I think that a simply restructure of the correlation matrix would work, with no need of loops.在 OP 评论和@user6386471 回答之后,我再次阅读了这个问题,我认为简单地重构相关矩阵就可以了,不需要循环。 Like half_of_matrix.stack().reset_index() plus filters.half_of_matrix.stack().reset_index()加上过滤器。 See:看:

def find_correlated_features(df, threshold, target_variable):
    # remove target column
    df = df.drop(columns=target_variable).copy()
    # Get correlation matrix
    corr_matrix = df.corr().abs()
    # Take half of the matrix to prevent doubling results
    corr_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
    # Restructure correlation matrix to dataframe
    df = corr_matrix.stack().reset_index()
    df.columns = ['feature1', 'feature2', 'corr_coef']
    # Apply filter and sort coefficients
    df = df[df.corr_coef >= threshold].sort_values('corr_coef', ascending=False)
    return df

Original answer:原答案:

You can easily create a Series with the coefficients above a threshold like this:您可以轻松地创建系数高于阈值的Series ,如下所示:

s = df.corr().loc[target_col]
s[s.abs() >= threshold]

where df is your dataframe, target_col your target column, and threshold , you know, the threshold.其中df是您的 dataframe, target_col是您的目标列,而threshold是阈值。


Example:例子:

import pandas as pd
import seaborn as sns

df = sns.load_dataset('iris')

print(df.shape)
# -> (150, 5)

print(df.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
def find_correlated_features(df, threshold, target_variable):
    s = df.corr().loc[target_variable].drop(target_variable)
    return s[s.abs() >= threshold]

find_correlated_features(df, .7, 'sepal_length')

output: output:

petal_length    0.871754
petal_width     0.817941
Name: sepal_length, dtype: float64

You can use .to_frame() followed by .T to the outptut to get a pandas dataframe.您可以使用.to_frame()后跟.T来输出 pandas dataframe。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas:如何最好地选择不相关的特征? - Pandas: How to best select features that are not correlated? 如何在 Pandas 的三列中返回最相关的特征? - How to return most correlated features in three columns in Pandas? 使用 QR 分解的 Python 线性回归(相关特征) - Python Linear regression using QR decomposition (correlated features) 如何使用python打印随机森林回归中重要特征的顺序? - How to print the order of important features in Random Forest regression using python? 如何使用 Pandas 使用 Python 打印每一行 - How to print every rows with Pandas using Python 如何在具有特定格式的列表中获取相关特征? - How to get correlated features in a list with a specific format? 如何删除低相关特征 - how can I drop low correlated features 如何创建自定义 Python class 以在管道中使用以删除高度相关的特征? - How to create a custom Python class to be used in Pipeline for dropping highly correlated features? 如何使用pandas python从最大日期打印最后3个数据 - how to print last 3 datas from the maximum date using pandas python 如何使用Python打印pandas数据帧的各个行? - How to print individual rows of a pandas dataframe using Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM