I'm trying to get the variances from the eigen vectors.
What is the difference between explained_variance_ratio_
and explained_variance_
in PCA
?
The example used by @seralouk unfortunately already has only 2 components. So, the explanation for pca.explained_variance_ratio_
is incomplete.
The denominator should be the sum of pca.explained_variance_ratio_
for the original set of features before PCA was applied, where the number of components can be greater than the number of components used in PCA.
Here is an explanation of this quantity using the iris dataset.
import numpy as np
from sklearn import datasets
from sklearn.decomposition import PCA
iris=datasets.load_iris()
X = iris.data
#y = iris.target
pca_2c_model=PCA(n_components=2)
x_pca_2c=pca_2c_model.fit_transform(X)
print('Explained variance:\n', pca_2c_model.explained_variance_)
print('Explained variance ratio:\n', pca_2c_model.explained_variance_ratio_)
returns
Explained variance:
[4.22824171 0.24267075]
Explained variance ratio:
[0.92461872 0.05306648]
The quantity pca_2c_model.explained_variance_
contains the diagonal elements of the covariance of the two principal components. For principal components, by very definition the covariance matrix should be diagonal.
var=np.cov(x_pca_2c.T)
explained_var=var.diagonal()
print('Explained variance calculated manually is\n',explained_var)
returns
Explained variance calculated manually is
[4.22824171 0.24267075]
To calculate the ratio, the denominator has to be calculated for the original set of features before PCA (ie on all components). So, we can just use the trace of the covariance of the full set of features to find the ratio. Here, we are using the idea of trace invariance of a matrix.
all_var=np.cov(X.T)
sum_all_var=np.sum(all_var.diagonal()) # same as np.trace(all_var)
explained_var_ratio=explained_var/sum_all_var
print('Explained variance ratio calculated manually is\n',explained_var_ratio)
returns
Explained variance ratio calculated manually is
[0.92461872 0.05306648]
Further,
print(sum(explained_var_ratio))
returns
0.9776852063187955
So, the sum of explained_variance_ratio_
does not add to 1.0 implying that the small deviation from 1.0 is contained in the other components of the original feature space.
The percentage of the explained variance is:
explained_variance_ratio_
The variance ie the eigenvalues of the covariance matrix is:
explained_variance_
Formula: explained_variance_ratio_ = explained_variance_ / np.sum(explained_variance_)
Example:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
pca.explained_variance_
array([7.93954312, 0.06045688]) # the actual eigenvalues (variance)
pca.explained_variance_ratio_ # the percentage of the variance
array([0.99244289, 0.00755711])
Also based on the above formula:
7.93954312 / (7.93954312+ 0.06045688) = 0.99244289
From the documentation:
explained_variance_ : array, shape (n_components,) The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
New in version 0.18.
explained_variance_ratio_ : array, shape (n_components,) Percentage of variance explained by each of the selected components.
If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.
It's just normalization to see how each principal component important. You can say: explained_variance_ratio_ = explained_variance_/np.sum(explained_variance_)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.