简体   繁体   English

在sklearn 0.15.0中,随机化PCA .explained_variance_ratio_总和大于1

[英]Randomized PCA .explained_variance_ratio_ sums to greater than one in sklearn 0.15.0

When I run this code with sklearn.__version__ 0.15.0 I get a strange result: 当我使用sklearn.__version__运行此代码时,我得到一个奇怪的结果:

import numpy as np
from scipy import sparse
from sklearn.decomposition import RandomizedPCA

a = np.array([[1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
              [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]])

s = sparse.csr_matrix(a)

pca = RandomizedPCA(n_components=20)
pca.fit_transform(s)

With 0.15.0 I get: 我得到0.15.0:

>>> pca.explained_variance_ratio_.sum()
>>> 2.1214285714285697

with '0.14.1' I get: 我得到'0.14.1':

>>> pca.explained_variance_ratio_.sum()
>>> 0.99999999999999978

The sum should not be greater than 1 总和不应大于1

Percentage of variance explained by each of the selected components. 每个所选组件解释的差异百分比。 k is not set then all components are stored and the sum of explained variances is equal to 1.0 k未设置,则存储所有组件,并且解释的方差之和等于1.0

What is going on here? 这里发生了什么?

The behavior in 0.14.1 is a bug as its explained_variance_ratio_.sum() used to always return 1.0 irrespective of the number of components to extract (the truncation). 0.14.1中的行为是一个错误,因为其explain_variance_ratio_.sum explained_variance_ratio_.sum()用于始终返回1.0,而不管要提取的组件数(截断)。 In 0.15.0 this was fixed for dense arrays as the following demonstrates: 在0.15.0中,这对于密集阵列是固定的,如下所示:

>>> RandomizedPCA(n_components=3).fit(a).explained_variance_ratio_.sum()
0.86786547849848206
>>> RandomizedPCA(n_components=4).fit(a).explained_variance_ratio_.sum()
0.95868429631268515
>>> RandomizedPCA(n_components=5).fit(a).explained_variance_ratio_.sum()
1.0000000000000002

Your data has rank 5 (100% of the variance is explained by 5 components). 您的数据排名为5(100%的差异由5个组成部分解释)。

If you try to call RandomizedPCA on a sparse matrix you will get: 如果您尝试在稀疏矩阵上调用RandomizedPCA ,您将获得:

DeprecationWarning: Sparse matrix support is deprecated and will be dropped in 0.16. Use TruncatedSVD instead.

The use of RandomizedPCA on sparse data is incorrect as we cannot center the data without breaking the sparsity which can blow up the memory on realistically sized sparse data. 在稀疏数据上使用RandomizedPCA是不正确的,因为我们无法在不破坏稀疏性的情况下使数据居中,这可能会在实际大小的稀疏数据上炸毁内存。 However centering is required for PCA. 然而,PCA需要居中。

TruncatedSVD will give you correct explained variance ratios on sparse data (but keep in mind that it does not do exactly the same thing as PCA on dense data): TruncatedSVD将为稀疏数据提供正确解释的方差比(但请记住,它与密集数据上的PCA不完全相同):

>>> TruncatedSVD(n_components=3).fit(s).explained_variance_ratio_.sum()
0.67711305361490826
>>> TruncatedSVD(n_components=4).fit(s).explained_variance_ratio_.sum()
0.8771350212934137
>>> TruncatedSVD(n_components=5).fit(s).explained_variance_ratio_.sum()
0.95954459082530097

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM