[英]Python: How do I write a function to determine which variable in a dataframe has the highest absolute correlation with a specified column?
I would like to write a function to determine which variable in a dataframe has the highest absolute correlation with a specific column.我想写一个 function 来确定 dataframe 中的哪个变量与特定列的绝对相关性最高。 However, I am having difficulty to get the column name from the correlation matrix.但是,我很难从相关矩阵中获取列名。
Say that my data, df
, is as following:假设我的数据df
如下:
address地址 | size尺寸 | rent_price租金价格 | number_of_bathrooms number_of_浴室 | number_of_rooms房间的数量 |
---|---|---|---|---|
East东方 | 12 12 | 3400 3400 | 2 2 | 4 4 |
North East东北 | 99 99 | 4200 4200 | 4 4 | 4 4 |
South南 | 99 99 | 4000 4000 | 5 5 | 5 5 |
I use ab_col_matrix = abs(df.corr())
to generate the correlation matrix something like, with column names at the top and the left-hand side of the matrix.我使用ab_col_matrix = abs(df.corr())
来生成类似的相关矩阵,列名位于矩阵的顶部和左侧。
1 value value value
value 1 value value
value value 1 value
value value value 1
Say that I am interested in the highest correlated column to the size column.假设我对与大小列相关的最高列感兴趣。 My idea is that I would sort the column and take the first row and return the column name with the highest value.我的想法是对列进行排序并取第一行并返回具有最高值的列名。
so I tried, sorted = ab_col_matrix.sort_values('size', ascending = False)
\所以我尝试了, sorted = ab_col_matrix.sort_values('size', ascending = False)
\
then I tried to pick highest one, the sorted['size'][1]
but it is only returning the value itself but not the column and I am puzzled how I could access that.然后我尝试选择最高的sorted['size'][1]
但它只返回值本身而不是列,我很困惑如何访问它。 Here I used [1]
because [0]
would return 1 which is the correlation value for its own column.这里我使用[1]
因为[0]
将返回 1,这是它自己列的相关值。
I would very much appreciate any help where I could gain more knowledge as to how to achieve this.我将非常感谢任何帮助,我可以获得更多关于如何实现这一目标的知识。
You can simply select the column for the variable you want and then sort the rows:您可以简单地 select 您想要的变量的列,然后对行进行排序:
ab_col_matrix['size'].sort_values(ascending=False)
size 1.000000
rent_price 0.970725
number_of_bathrooms 0.944911
number_of_rooms 0.500000
Name: size, dtype: float64
You can then select the highest correlated value with the following:然后,您可以 select 与以下最高相关值:
ab_col_matrix['size'].sort_values(ascending=False).index[1]
'rent_price'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.