[英]How to rank features correctly from PCA's eigenvector
My goal is to rank features of a supervised machine learning dataset, by contributions to theris Principal component, thanks to this answer .由于这个答案,我的目标是通过对 theris Principal component 的贡献对受监督的机器学习数据集的特征进行排名。
I set up an experiment in which I construct a dataset contains 3 infomative, 3 redundent and 3 noise features in order.我设置了一个实验,其中我构建了一个数据集,该数据集包含 3 个信息性、3 个冗余和 3 个噪声特征。 Then find the index of the largest component on each principal axes.
然后找到每个主轴上最大分量的索引。
However, I got a realy worse rank by using this method.但是,通过使用这种方法,我得到了一个非常糟糕的排名。 Dont know what mistakes I have made.
不知道我犯了什么错误。 Many thanks for helping.
非常感谢您的帮助。 Here is my code:
这是我的代码:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
# Make a dataset which contains 3 Infomative, redundant, noise features respectively
X, _ = make_classification(n_samples=20, n_features=9, n_informative=3,
n_redundant=3, random_state=0, shuffle=False)
cols = ['I_'+str(i) for i in range(3)]
cols += ['R_'+str(i) for i in range(3)]
cols += ['N_'+str(i) for i in range(3)]
dfX = pd.DataFrame(X, columns=cols)
# Rank each feature by each priciple axis maximum component
model = PCA().fit(dfX)
_ = model.transform(dfX)
n_pcs= model.components_.shape[0]
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
most_important_names = [dfX.columns[most_important[i]] for i in range(n_pcs)]
rank = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
rank outputs:排名输出:
{'PC0': 'R_1',
'PC1': 'I_1',
'PC2': 'N_1',
'PC3': 'N_0',
'PC4': 'N_2',
'PC5': 'I_2',
'PC6': 'R_1',
'PC7': 'R_0',
'PC8': 'R_2'}
I am expecting to see infomative features I_x
to be ranked top3.我期待看到信息特征
I_x
排名前三。
PCA
ranking criteria is the variance of each columns, if you would like to have a ranking, what you can do is to output the VarianceThreshold
of each columns. PCA
排名标准是每列的方差,如果你想排名,你可以做的是输出每列的VarianceThreshold
。 You can do that by this你可以这样做
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold()
selector.fit_transform(dfX)
print(selector.variances_)
# outputs [1.57412087 1.08363799 1.11752334 0.58501874 2.2983772 0.2857617
# 1.09782539 0.98715471 0.93262548]
Which you can clearly see that the first 3 columns (I0, I1,I2) has the greatest variance, and thus makes the best candidate for using PCA
with.您可以清楚地看到前 3 列(I0、I1、I2)具有最大的方差,因此是使用
PCA
的最佳候选者。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.