Python: How do I write a function to determine which variable in a dataframe has the highest absolute correlation with a specified column?

Question

I would like to write a function to determine which variable in a dataframe has the highest absolute correlation with a specific column. However, I am having difficulty to get the column name from the correlation matrix.

Say that my data, df , is as following:

address	size	rent_price	number_of_bathrooms	number_of_rooms
East	12	3400	2	4
North East	99	4200	4	4
South	99	4000	5	5

I use ab_col_matrix = abs(df.corr()) to generate the correlation matrix something like, with column names at the top and the left-hand side of the matrix.

1 value value value 
value 1 value value 
value value 1 value 
value value value 1

Say that I am interested in the highest correlated column to the size column. My idea is that I would sort the column and take the first row and return the column name with the highest value.

so I tried, sorted = ab_col_matrix.sort_values('size', ascending = False) \

then I tried to pick highest one, the sorted['size'][1] but it is only returning the value itself but not the column and I am puzzled how I could access that. Here I used [1] because [0] would return 1 which is the correlation value for its own column.

I would very much appreciate any help where I could gain more knowledge as to how to achieve this.

Answer 1

You can simply select the column for the variable you want and then sort the rows:

ab_col_matrix['size'].sort_values(ascending=False)

size                   1.000000
rent_price             0.970725
number_of_bathrooms    0.944911
number_of_rooms        0.500000
Name: size, dtype: float64

You can then select the highest correlated value with the following:

ab_col_matrix['size'].sort_values(ascending=False).index[1]

'rent_price'

Python: How do I write a function to determine which variable in a dataframe has the highest absolute correlation with a specified column?

Question

1 answers

solution1
0 2021-06-08 20:10:35

Python: How do I write a function to determine which variable in a dataframe has the highest absolute correlation with a specified column?

Question

1 answers

solution1 0 2021-06-08 20:10:35

solution1
0 2021-06-08 20:10:35