I would like to write a function to determine which variable in a dataframe has the highest absolute correlation with a specific column. However, I am having difficulty to get the column name from the correlation matrix.
Say that my data, df
, is as following:
address | size | rent_price | number_of_bathrooms | number_of_rooms |
---|---|---|---|---|
East | 12 | 3400 | 2 | 4 |
North East | 99 | 4200 | 4 | 4 |
South | 99 | 4000 | 5 | 5 |
I use ab_col_matrix = abs(df.corr())
to generate the correlation matrix something like, with column names at the top and the left-hand side of the matrix.
1 value value value
value 1 value value
value value 1 value
value value value 1
Say that I am interested in the highest correlated column to the size column. My idea is that I would sort the column and take the first row and return the column name with the highest value.
so I tried, sorted = ab_col_matrix.sort_values('size', ascending = False)
\
then I tried to pick highest one, the sorted['size'][1]
but it is only returning the value itself but not the column and I am puzzled how I could access that. Here I used [1]
because [0]
would return 1 which is the correlation value for its own column.
I would very much appreciate any help where I could gain more knowledge as to how to achieve this.
You can simply select the column for the variable you want and then sort the rows:
ab_col_matrix['size'].sort_values(ascending=False)
size 1.000000
rent_price 0.970725
number_of_bathrooms 0.944911
number_of_rooms 0.500000
Name: size, dtype: float64
You can then select the highest correlated value with the following:
ab_col_matrix['size'].sort_values(ascending=False).index[1]
'rent_price'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.