简体   繁体   English

Sklearn PCA解释方差和解释方差比差

[英]Sklearn PCA explained variance and explained variance ratio difference

I'm trying to get the variances from the eigen vectors.我正在尝试从特征向量中获取方差。

What is the difference between explained_variance_ratio_ and explained_variance_ in PCA ? PCA中的explained_variance_ratio_explained_variance_有什么区别?

The example used by @seralouk unfortunately already has only 2 components.不幸的是,@seralouk 使用的示例已经只有 2 个组件。 So, the explanation for pca.explained_variance_ratio_ is incomplete.因此,对pca.explained_variance_ratio_的解释是不完整的。

The denominator should be the sum of pca.explained_variance_ratio_ for the original set of features before PCA was applied, where the number of components can be greater than the number of components used in PCA.分母应该是应用 PCA 之前原始特征集的pca.explained_variance_ratio_的总和,其中分量的数量可以大于 PCA 中使用的分量的数量。

Here is an explanation of this quantity using the iris dataset.这是使用 iris 数据集对此数量的解释。

import numpy as np
from sklearn import datasets
from sklearn.decomposition import PCA

iris=datasets.load_iris()

X = iris.data     
#y = iris.target   

pca_2c_model=PCA(n_components=2)
x_pca_2c=pca_2c_model.fit_transform(X) 

print('Explained variance:\n', pca_2c_model.explained_variance_)

print('Explained variance ratio:\n', pca_2c_model.explained_variance_ratio_)

returns回报

Explained variance: 
 [4.22824171 0.24267075]
Explained variance ratio:
 [0.92461872 0.05306648]

The quantity pca_2c_model.explained_variance_ contains the diagonal elements of the covariance of the two principal components.数量pca_2c_model.explained_variance_包含两个主成分协方差的对角线元素。 For principal components, by very definition the covariance matrix should be diagonal.对于主成分,根据定义,协方差矩阵应该是对角线的。

var=np.cov(x_pca_2c.T)
explained_var=var.diagonal()
print('Explained variance calculated manually is\n',explained_var)

returns回报

Explained variance calculated manually is
 [4.22824171 0.24267075]

To calculate the ratio, the denominator has to be calculated for the original set of features before PCA (ie on all components).要计算比率,必须在 PCA 之前为原始特征集计算分母(即在所有组件上)。 So, we can just use the trace of the covariance of the full set of features to find the ratio.因此,我们可以只使用完整特征集的协方差的迹来找到比率。 Here, we are using the idea of trace invariance of a matrix.在这里,我们使用了矩阵的迹不变性的思想。

all_var=np.cov(X.T)
sum_all_var=np.sum(all_var.diagonal()) # same as np.trace(all_var)
explained_var_ratio=explained_var/sum_all_var
print('Explained variance ratio calculated manually is\n',explained_var_ratio)

returns回报

Explained variance ratio calculated manually is
 [0.92461872 0.05306648]

Further,进一步,

print(sum(explained_var_ratio))

returns回报

0.9776852063187955

So, the sum of explained_variance_ratio_ does not add to 1.0 implying that the small deviation from 1.0 is contained in the other components of the original feature space.因此, explained_variance_ratio_的总和不会加到 1.0,这意味着与 1.0 的小偏差包含在原始特征空间的其他分量中。

The percentage of the explained variance is: 解释方差的百分比为:

explained_variance_ratio_

The variance ie the eigenvalues of the covariance matrix is: 方差,即协方差矩阵的特征值是:

explained_variance_

Formula: explained_variance_ratio_ = explained_variance_ / np.sum(explained_variance_) 公式: explained_variance_ratio_ = explained_variance_ / np.sum(explained_variance_)

Example: 例:

import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)  
pca.explained_variance_
array([7.93954312, 0.06045688]) # the actual eigenvalues (variance)

pca.explained_variance_ratio_ # the percentage of the variance
array([0.99244289, 0.00755711])

Also based on the above formula: 同样基于以上公式:

7.93954312 / (7.93954312+ 0.06045688) = 0.99244289

From the documentation: 从文档中:

explained_variance_ : array, shape (n_components,) The amount of variance explained by each of the selected components. 解释型变量_:数组,形状(n_components,)由每个选定的分量说明的方差量。

Equal to n_components largest eigenvalues of the covariance matrix of X. 等于X的协方差矩阵的n_components个最大特征值。

New in version 0.18. 版本0.18中的新功能。

explained_variance_ratio_ : array, shape (n_components,) Percentage of variance explained by each of the selected components. 解释的差异变量:数组,形状(n_components,)每个选定分量解释的方差百分比。

If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0. 如果未设置n_components,则将存储所有分量,并且比率之和等于1.0。

It's just normalization to see how each principal component important. 只是标准化,以了解每个主要组成部分的重要性。 You can say: explained_variance_ratio_ = explained_variance_/np.sum(explained_variance_) 您可以说: explained_variance_ratio_ = explained_variance_/np.sum(explained_variance_)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM