简体   繁体   中英

Get pairs of variables from correlation matrix that minimize the sum of correlations

Let's say I get a correlation matrix from a dataframe like here .

Among all pairs of variables, I want to select X variables such that the combination of these X variables is the one for which the total sum of correlation is minimal.

How to do so ?

Here is a not so efficient solution (that gets the 3 out of 4 features, which can be easily extended to 6 out of 10 if you change the n_features from 3 to 6 ), which works though

import pandas as pd
foo = pd.DataFrame({'vars': ['col_a', 'col_b', 'col_c', 'col_d'],
                   'col_a': [1, 0.9, 0.04, 0.03],
                   'col_b': [0.9,1,0.05,0.03],
                   'col_c': [0.04, 0.05, 1, -0.04],
                   'col_d': [0.03, 0.03, -0.04,1]})

import numpy as np
import itertools

n_features = 3
test_cols = ['col_a', 'col_b', 'col_c', 'col_d']
sum_l = {}
for l in list(itertools.combinations(test_cols, n_features)):

    sum_l2 = 0
    for l2 in list(itertools.combinations(l, 2)):

        sum_l2 += np.abs(foo.query('vars == @l2[0]')[l2[1]].values[0])
    sum_l[l] = sum_l2
print(sum_l)
print(min(sum_l, key=sum_l.get))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM