简体   繁体   English

使用sklearn在PCA中恢复explain_variance_ratio_的特征名称

[英]Recovering features names of explained_variance_ratio_ in PCA with sklearn

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant .我正在尝试从使用 scikit-learn 完成的 PCA 中恢复,哪些特征被选为相关的

A classic example with IRIS dataset. IRIS 数据集的经典示例。

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
df_norm = (df - df.mean()) / df.std()

# PCA
pca = PCA(n_components=2)
pca.fit_transform(df_norm.values)
print pca.explained_variance_ratio_

This returns这返回

In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452,  0.23030523])

How can I recover which two features allow these two explained variance among the dataset ?我怎样才能恢复哪两个特征允许这两个数据集之间的解释差异? Said diferently, how can i get the index of this features in iris.feature_names ?换个说法,我如何在 iris.feature_names 中获取此功能的索引?

In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Thanks in advance for your help.在此先感谢您的帮助。

This information is included in the pca attribute: components_ .此信息包含在pca属性中: components_ As described in the documentation , pca.components_ outputs an array of [n_components, n_features] , so to get how components are linearly related with the different features you have to:文档中所述, pca.components_输出一个[n_components, n_features]数组,因此要了解组件与不同功能的线性关系,您必须:

Note : each coefficient represents the correlation between a particular pair of component and feature注意:每个系数代表特定对组件和特征之间的相关性

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

# Dump components relations with features:
print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']))

      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
PC-1           0.522372         -0.263355           0.581254          0.565611
PC-2          -0.372318         -0.925556          -0.021095         -0.065416

IMPORTANT: As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component.重要提示:作为旁注,请注意 PCA 符号不影响其解释,因为该符号不影响每个组件中包含的方差。 Only the relative signs of features forming the PCA dimension are important.只有形成 PCA 维度的特征的相对符号才是重要的。 In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted.事实上,如果您再次运行 PCA 代码,您可能会得到符号反转的 PCA 尺寸。 For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space.为了直观地了解这一点,请考虑 3-D 空间中的向量及其负数 - 两者本质上都表示空间中的相同方向。 Check this post for further reference.检查此帖子以获取进一步参考。

Edit: as others have commented, you may get same values from .components_ attribute.编辑:正如其他人评论的那样,您可能会从.components_属性中获得相同的值。


Each principal component is a linear combination of the original variables:每个主成分都是原始变量的线性组合:

pca-coef

where X_i s are the original variables, and Beta_i s are the corresponding weights or so called coefficients.其中X_i s 是原始变量,而Beta_i s 是相应的权重或所谓的系数。

To obtain the weights, you may simply pass identity matrix to the transform method:要获得权重,您可以简单地将单位矩阵传递给transform方法:

>>> i = np.identity(df.shape[1])  # identity matrix
>>> i
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

>>> coef = pca.transform(i)
>>> coef
array([[ 0.5224, -0.3723],
       [-0.2634, -0.9256],
       [ 0.5813, -0.0211],
       [ 0.5656, -0.0654]])

Each column of the coef matrix above shows the weights in the linear combination which obtains corresponding principal component:上面的coef矩阵的每一列都显示了线性组合中的权重,以获得相应的主成分:

>>> pd.DataFrame(coef, columns=['PC-1', 'PC-2'], index=df.columns)
                    PC-1   PC-2
sepal length (cm)  0.522 -0.372
sepal width (cm)  -0.263 -0.926
petal length (cm)  0.581 -0.021
petal width (cm)   0.566 -0.065

[4 rows x 2 columns]

For example, above shows that the second principal component ( PC-2 ) is mostly aligned with sepal width , which has the highest weight of 0.926 in absolute value;例如,上图显示第二主成分( PC-2 )大部分与sepal width对齐,其绝对值最高权重为0.926

Since the data were normalized, you can confirm that the principal components have variance 1.0 which is equivalent to each coefficient vector having norm 1.0 :由于数据已归一化,您可以确认主成分的方差为1.0 ,这相当于每个系数向量的范数为1.0

>>> np.linalg.norm(coef,axis=0)
array([ 1.,  1.])

One may also confirm that the principal components can be calculated as the dot product of the above coefficients and the original variables:也可以确认主成分可以计算为上述系数和原始变量的点积:

>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values))
True

Note that we need to use numpy.allclose instead of regular equality operator, because of floating point precision error.请注意,由于浮点精度错误,我们需要使用numpy.allclose而不是常规相等运算符。

The way this question is phrased reminds me of a misunderstanding of Principle Component Analysis when I was first trying to figure it out.这个问题的措辞方式让我想起了当我第一次试图弄清楚它时对主成分分析的误解。 I'd like to go through it here in the hope that others won't spend as much time on a road-to-nowhere as I did before the penny finally dropped.我想在这里通读一遍,希望其他人不会像我在一分钱最终落下之前那样,在无路可走的路上花费太多时间。

The notion of “recovering” feature names suggests that PCA identifies those features that are most important in a dataset. “恢复”特征名称的概念表明 PCA 可以识别数据集中最重要的特征。 That's not strictly true.这不完全正确。

PCA, as I understand it, identifies the features with the greatest variance in a dataset, and can then use this quality of the dataset to create a smaller dataset with a minimal loss of descriptive power.据我了解,PCA 识别数据集中方差最大的特征,然后可以使用数据集的这种质量来创建一个较小的数据集,同时将描述能力的损失降至最低。 The advantages of a smaller dataset is that it requires less processing power and should have less noise in the data.较小数据集的优势在于它需要较少的处理能力,并且数据中的噪声较小。 But the features of greatest variance are not the "best" or "most important" features of a dataset, insofar as such concepts can be said to exist at all.但是最大方差的特征并不是数据集的“最佳”或“最重要”特征,只要这些概念可以说是完全存在的。

To bring that theory into the practicalities of @Rafa's sample code above:将该理论带入上述@Rafa 示例代码的实用性中:

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

consider the following:考虑以下:

post_pca_array = pca.fit_transform(data_scaled)

print data_scaled.shape
(150, 4)

print post_pca_array.shape
(150, 2)

In this case, post_pca_array has the same 150 rows of data as data_scaled , but data_scaled 's four columns have been reduced from four to two.在这种情况下, post_pca_array具有作为相同的150行数据data_scaled ,但data_scaled的四列已经从四个减少到两个。

The critical point here is that the two columns – or components, to be terminologically consistent – of post_pca_array are not the two “best” columns of data_scaled .这里的关键点是,两列-或部件,要在术语一致-的post_pca_array不是两个“最佳”列data_scaled They are two new columns, determined by the algorithm behind sklearn.decomposition 's PCA module.它们是两个新列,由sklearn.decompositionPCA模块背后的算法确定。 The second column, PC-2 in @Rafa's example, is informed by sepal_width more than any other column, but the values in PC-2 and data_scaled['sepal_width'] are not the same. sepal_width示例中的第二列PC-2sepal_width通知的sepal_width比任何其他列sepal_width多,但PC-2data_scaled['sepal_width']中的值不相同。

As such, while it's interesting to find out how much each column in original data contributed to the components of a post-PCA dataset, the notion of “recovering” column names is a little misleading, and certainly misled me for a long time.因此,虽然找出原始数据中的每一列对 post-PCA 数据集的组成部分的贡献很有趣,但“恢复”列名的概念有点误导,而且肯定误导了我很长时间。 The only situation where there would be a match between post-PCA and original columns would be if the number of principle components were set at the same number as columns in the original.后 PCA 和原始列之间匹配的唯一情况是主成分的数量设置为与原始列中的列数相同。 However, there would be no point in using the same number of columns because the data would not have changed.但是,使用相同数量的列没有意义,因为数据不会改变。 You would only have gone there to come back again, as it were.你去那里只会再次回来,就像它一样。

The important features are the ones that influence more the components and thus, have a large absolute value/coefficient/loading on the component.重要的特征是那些影响更多组件的特征,因此对组件具有大的绝对值/系数/负载。

Get the most important feature name on the PCs :获取 PC 上the most important feature name

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component i.e. largest absolute value
# using LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']

# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# using LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(sorted(dic.items()))

This prints:这打印:

     0  1
 0  PC1  e
 1  PC2  d

Conclusion/Explanation:结论/解释:

So on the PC1 the feature named e is the most important and on PC2 the d .所以在PC1命名的功能e是最重要和PC2的d

给定您的拟合估计量pca ,将在pca.components_找到pca.components_ ,它们表示数据集中方差最大的方向。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM