Let's say I get a correlation matrix from a dataframe like here .
Among all pairs of variables, I want to select X variables such that the combination of these X variables is the one for which the total sum of correlation is minimal.
How to do so ?
Here is a not so efficient solution (that gets the 3 out of 4 features, which can be easily extended to 6 out of 10 if you change the n_features
from 3
to 6
), which works though
import pandas as pd
foo = pd.DataFrame({'vars': ['col_a', 'col_b', 'col_c', 'col_d'],
'col_a': [1, 0.9, 0.04, 0.03],
'col_b': [0.9,1,0.05,0.03],
'col_c': [0.04, 0.05, 1, -0.04],
'col_d': [0.03, 0.03, -0.04,1]})
import numpy as np
import itertools
n_features = 3
test_cols = ['col_a', 'col_b', 'col_c', 'col_d']
sum_l = {}
for l in list(itertools.combinations(test_cols, n_features)):
sum_l2 = 0
for l2 in list(itertools.combinations(l, 2)):
sum_l2 += np.abs(foo.query('vars == @l2[0]')[l2[1]].values[0])
sum_l[l] = sum_l2
print(sum_l)
print(min(sum_l, key=sum_l.get))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.