简体   繁体   English

Python scikit 学习 pca.explained_variance_ratio_ cutoff

[英]Python scikit learn pca.explained_variance_ratio_ cutoff

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.在选择主成分 (k) 的数量时,我们选择 k 作为最小值,以便例如保留 99% 的方差。

However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to "99% of variance is retained"?但是,在 Python Scikit 学习中,我不是 100% 确定pca.explained_variance_ratio_ = 0.99等于“保留了 99% 的方差”? Could anyone enlighten?有谁能开导吗? Thanks.谢谢。

  • The Python Scikit learn PCA manual is here Python Scikit 学习 PCA 手册在这里

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

Yes, you are nearly right.是的,你几乎是对的。 The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. pca.explained_variance_ratio_参数返回每个维度解释的方差向量。 Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.因此pca.explained_variance_ratio_[i]给出了仅由 i+1 维解释的方差。

You probably want to do pca.explained_variance_ratio_.cumsum() .你可能想做pca.explained_variance_ratio_.cumsum() That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.这将返回一个向量x使得x[i]返回由前 i+1 个维度解释的累积方差。

import numpy as np
from sklearn.decomposition import PCA

np.random.seed(0)
my_matrix = np.random.randn(20, 5)

my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)

print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()

[ 1.50756565  1.29374452  0.97042041  0.61712667  0.31529082]
[ 0.32047581  0.27502207  0.20629036  0.13118776  0.067024  ]
[ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]

So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.所以在我的随机玩具数据中,如果我选择k=4我将保留 93.3% 的方差。

Although this question is older than 2 years i want to provide an update on this.虽然这个问题已经超过 2 年了,但我想提供一个更新。 I wanted to do the same and it looks like sklearn now provides this feature out of the box.我想做同样的事情,看起来 sklearn 现在提供了开箱即用的功能。

As stated in the docs文档中所述

if 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components如果 0 < n_components < 1 且 svd_solver == 'full',则选择需要解释的方差量大于 n_components 指定的百分比的分量数

So the code required is now所以现在需要的代码是

my_model = PCA(n_components=0.99, svd_solver='full')
my_model.fit_transform(my_matrix)

This worked for me with even less typing in the PCA section.这对我有用,在 PCA 部分打字更少。 The rest is added for convenience.为方便起见,添加其余部分。 Only 'data' needs to be defined in an earlier stage.只有“数据”需要在较早的阶段进行定义。

import sklearn as sl
from sklearn.preprocessing import StandardScaler as ss
from sklearn.decomposition import PCA 

st = ss().fit_transform(data)
pca = PCA(0.80)
pc = pca.fit_transform(st) # << to retain the components in an object
pc

#pca.explained_variance_ratio_
print ( "Components = ", pca.n_components_ , ";\nTotal explained variance = ",
      round(pca.explained_variance_ratio_.sum(),5)  )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM