[英]How to print correlated features using python pandas?
I'm trying to get some information on the correlation of the independent variables.我正在尝试获取有关自变量相关性的一些信息。
My dataset has a lot of variables, therefore the heatmap is not solution, it is very unreadable.我的数据集有很多变量,因此热图不是解决方案,它非常不可读。
Currently, I have made a function that returns only those variables that are highly correlated.目前,我制作了一个 function,它只返回那些高度相关的变量。 I would like to change it in way to indicate pairs of correlated features.
我想改变它以指示成对的相关特征。
The rest of the explanations below: rest的解释如下:
def find_correlated_features(df, threshold, target_variable):
df_1 = df.drop(target_variable)
#corr_matrix has in index and columns names of variables
corr_matrix = df_1.corr().abs()
# I'm taking only half of this matrix to prevent doubling results
half_of_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
# This prints list of columns which are correlated
to_drop = [column for column in half_of_matrix.columns if any(half_of_matrix[column] > threshold)]
return to_drop
The best if this function would return pandas dataframe with column_1;如果此 function 将返回 pandas dataframe 和 column_1 ,则最好; column_2;
列_2; corr_coef only variables that are above threshold.
corr_coef 仅高于阈值的变量。
Something like this:像这样的东西:
output = {'feature name 1': column_name,
'feature name 2': index,
'correlation coef': corr_coef}
output_list.append(output)
return pd.DataFrame(output_list).sort_values('corr_coef', ascending=False)
This should match the output you're looking for:这应该与您正在寻找的 output 匹配:
import pandas as pd
import numpy as np
# Create fake correlation matrix
corr_matrix = np.random.random_sample((5, 5))
ii, jj = np.triu_indices(corr_matrix.shape[0], 1)
scores = []
for i, j in zip(ii, jj):
scores.append((i,j,corr_matrix[i,j]))
df_out = pd.DataFrame(data=scores,columns=['feature name 1','feature name 2','corr_coef'])\
.sort_values('corr_coef', ascending=False)\
.reset_index(drop=True)
threshold = 0.1
df_out[df_out['corr_coef'] > threshold]
# feature name 1 feature name 2 corr_coef
# 0 0 2 0.990691
# 1 2 4 0.990444
# 2 0 1 0.830640
# 3 1 2 0.623895
# 4 1 4 0.433258
# 5 3 4 0.404395
# 6 0 4 0.291564
# 7 2 3 0.276799
# 8 1 3 0.177519
And you can map the indices of the features (in columns feature name 1
and feature name 2
above) to the columns of your df_1 to get the actual feature names.您可以 map 将功能的索引(在上面的
feature name 1
和feature name 2
列中)到您的 df_1 的列中,以获取实际的功能名称。
So your complete function would look like this:所以你完整的 function 看起来像这样:
def find_correlated_features(df, threshold, target_variable):
df_1 = df.drop(target_variable)
#corr_matrix has in index and columns names of variables
corr_matrix = df_1.corr().abs().to_numpy()
ii, jj = np.triu_indices(corr_matrix.shape[0], 1)
scores = []
for i, j in zip(ii, jj):
scores.append((i,j,corr_matrix[i,j]))
df_out = pd.DataFrame(data=scores,columns=['feature name 1','feature name 2','corr_coef'])\
.sort_values('corr_coef', ascending=False)\
.reset_index(drop=True)
# This should go from the second column as the index loop that gave us
# the scores and indices were from the upper triangle offset by 1
feature_name_map = {i:c for i,c in enumerate(df_1.columns[1:])}
df_out['feature name 1'] = df_out['feature name 1'].map(feature_name_map)
df_out['feature name 2'] = df_out['feature name 2'].map(feature_name_map)
return df_out[df_out['corr_coef'] > threshold]
After Edit:编辑后:
After OP comment and @user6386471 answer, I've read again the question and I think that a simply restructure of the correlation matrix would work, with no need of loops.在 OP 评论和@user6386471 回答之后,我再次阅读了这个问题,我认为简单地重构相关矩阵就可以了,不需要循环。 Like
half_of_matrix.stack().reset_index()
plus filters.像
half_of_matrix.stack().reset_index()
加上过滤器。 See:看:
def find_correlated_features(df, threshold, target_variable):
# remove target column
df = df.drop(columns=target_variable).copy()
# Get correlation matrix
corr_matrix = df.corr().abs()
# Take half of the matrix to prevent doubling results
corr_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
# Restructure correlation matrix to dataframe
df = corr_matrix.stack().reset_index()
df.columns = ['feature1', 'feature2', 'corr_coef']
# Apply filter and sort coefficients
df = df[df.corr_coef >= threshold].sort_values('corr_coef', ascending=False)
return df
Original answer:原答案:
You can easily create a Series
with the coefficients above a threshold like this:您可以轻松地创建系数高于阈值的
Series
,如下所示:
s = df.corr().loc[target_col]
s[s.abs() >= threshold]
where df
is your dataframe, target_col
your target column, and threshold
, you know, the threshold.其中
df
是您的 dataframe, target_col
是您的目标列,而threshold
是阈值。
Example:例子:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('iris')
print(df.shape)
# -> (150, 5)
print(df.head())
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
def find_correlated_features(df, threshold, target_variable):
s = df.corr().loc[target_variable].drop(target_variable)
return s[s.abs() >= threshold]
find_correlated_features(df, .7, 'sepal_length')
output: output:
petal_length 0.871754
petal_width 0.817941
Name: sepal_length, dtype: float64
You can use .to_frame()
followed by .T
to the outptut to get a pandas dataframe.您可以使用
.to_frame()
后跟.T
来输出 pandas dataframe。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.