简体   繁体   中英

Sklearn PCA explained variance and explained variance ratio difference

I'm trying to get the variances from the eigen vectors.

What is the difference between explained_variance_ratio_ and explained_variance_ in PCA ?

The example used by @seralouk unfortunately already has only 2 components. So, the explanation for pca.explained_variance_ratio_ is incomplete.

The denominator should be the sum of pca.explained_variance_ratio_ for the original set of features before PCA was applied, where the number of components can be greater than the number of components used in PCA.

Here is an explanation of this quantity using the iris dataset.

import numpy as np
from sklearn import datasets
from sklearn.decomposition import PCA

iris=datasets.load_iris()

X = iris.data     
#y = iris.target   

pca_2c_model=PCA(n_components=2)
x_pca_2c=pca_2c_model.fit_transform(X) 

print('Explained variance:\n', pca_2c_model.explained_variance_)

print('Explained variance ratio:\n', pca_2c_model.explained_variance_ratio_)

returns

Explained variance: 
 [4.22824171 0.24267075]
Explained variance ratio:
 [0.92461872 0.05306648]

The quantity pca_2c_model.explained_variance_ contains the diagonal elements of the covariance of the two principal components. For principal components, by very definition the covariance matrix should be diagonal.

var=np.cov(x_pca_2c.T)
explained_var=var.diagonal()
print('Explained variance calculated manually is\n',explained_var)

returns

Explained variance calculated manually is
 [4.22824171 0.24267075]

To calculate the ratio, the denominator has to be calculated for the original set of features before PCA (ie on all components). So, we can just use the trace of the covariance of the full set of features to find the ratio. Here, we are using the idea of trace invariance of a matrix.

all_var=np.cov(X.T)
sum_all_var=np.sum(all_var.diagonal()) # same as np.trace(all_var)
explained_var_ratio=explained_var/sum_all_var
print('Explained variance ratio calculated manually is\n',explained_var_ratio)

returns

Explained variance ratio calculated manually is
 [0.92461872 0.05306648]

Further,

print(sum(explained_var_ratio))

returns

0.9776852063187955

So, the sum of explained_variance_ratio_ does not add to 1.0 implying that the small deviation from 1.0 is contained in the other components of the original feature space.

The percentage of the explained variance is:

explained_variance_ratio_

The variance ie the eigenvalues of the covariance matrix is:

explained_variance_

Formula: explained_variance_ratio_ = explained_variance_ / np.sum(explained_variance_)

Example:

import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)  
pca.explained_variance_
array([7.93954312, 0.06045688]) # the actual eigenvalues (variance)

pca.explained_variance_ratio_ # the percentage of the variance
array([0.99244289, 0.00755711])

Also based on the above formula:

7.93954312 / (7.93954312+ 0.06045688) = 0.99244289

From the documentation:

explained_variance_ : array, shape (n_components,) The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues of the covariance matrix of X.

New in version 0.18.

explained_variance_ratio_ : array, shape (n_components,) Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.

It's just normalization to see how each principal component important. You can say: explained_variance_ratio_ = explained_variance_/np.sum(explained_variance_)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM