简体   繁体   English

我应该如何解释 pca.components_ 的 output

[英]How should I interpret the output of pca.components_

I was reading this post Recovering features names of explained_variance_ratio_ in PCA with sklearn and I wanted to understand the output of the following line of code:我正在阅读这篇文章Recovering features names of Explained_variance_ratio_ in PCA with sklearn ,我想了解以下代码行的 output:

pd.DataFrame(pca.components_, columns=subset.columns)

First, I thought that pca components from sklearn would be how much of the variance is explained by each feature (I guess this is the interpretation of PCA, right?).首先,我认为来自 sklearn 的 pca 组件将是每个特征解释了多少方差(我猜这是对 PCA 的解释,对吧?)。 However, I think that this is actually wrong, and the explained variance is given by pca.explained_variance.但是,我认为这实际上是错误的,解释的方差由 pca.explained_variance 给出。

Also, the ouput of the dataframe constructed with the script above is very confused to me, because it has several lines and there are also negative numbers.另外,用上面的脚本构造的 dataframe 的输出让我很困惑,因为它有几行,也有负数。

Furthemore, how does the dataframe constructed above relates to the following plot:此外,上面构造的 dataframe 与以下 plot 有何关系:

plt.bar(range(pca.explained_variance_), pca.explained_variance_)

I'm really confused about the PCA components and the variance.我真的对 PCA 组件和方差感到困惑。

If some example is needed, we might build PCA with iris dataset.如果需要一些示例,我们可以使用 iris 数据集构建 PCA。 This is what I've done so far:这是我到目前为止所做的:

subset = iris.iloc[:, 1:5]
scaler = StandardScaler()
pca = PCA()

pipe = make_pipeline(scaler, pca)
pipe.fit(subset)

# Plot the explained variances
features = range(pca.n_components_)
_ = plt.bar(features, pca.explained_variance_)

# Dump components relations with features:
pd.DataFrame(pca.components_, columns=subset.columns)

In PCA, the components (in sklearn , the components_ ) are linear combinations between the original features, enhancing their variance.在 PCA 中,组件(在sklearn中, components_ )是原始特征之间的线性组合,增强了它们的方差。 So, their are vectors that combine the input features, in order to maximize the variance.因此,它们是组合输入特征的向量,以最大化方差。

In sklearn , as referenced here , the components_ are presented in order of their explained variance ( explained_variance_ ), from the highest to the lowest value.sklearn中,如此处所引用components_按其解释方差 ( explained_variance_ ) 的顺序呈现,从最高值到最低值。 So, the i-th vector of components_ has the i-th value of explained_variance_ .因此, components_的第 i 个向量具有explained_variance_的第 i 个值。

A useful link on PCA: https://online.stat.psu.edu/stat505/lesson/11关于 PCA 的有用链接: https://online.stat.psu.edu/stat505/lesson/11

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM