简体繁体中英

sklearn.pca() and n_components, linear algebra dilemma

原文 2018-07-11 17:18:22 2 1 python/ scikit-learn/ data-science/ pca/ data-analysis

Say I want to find the optimal number of components when doing PCA in Python3 with sklearn.

I'd do that by iterating over some n_components and computing total absolute prediction error for each value when validating the model.

My question would be, what's the difference between passing a n_components parameter to PCA and going from there, as opposed to not passing it and only using the first (i) components from the implicit max value it gets.

My linear algebra is a bit shaky, but if I recall correctly the single vectors should be the same in both situations, and ordered in ascending order, and provide the same amount of explained variance.

Sorry for not providing any code nor writing up both scenarios to test them myself, but I'm on a long train ride and my laptop battery ran out mid-process. Now I'm stuck with the curiosity.

1 answers

Your recollection of PCA is correct. The singular values will be the same for each component included.

Consider the following thought experiment. You have a small number of features. Fitting a full PCA and iterating to find the value for n_components that creates the optimal transformation for your estimator/classifier is trivial. You now have 1,000 features in your data. 10,000? 100,000? 1,000,000? See where I am going? A full PCA of such data would be both frivolous and computationally expensive. And that is before iterating through to find your optimal transformation.

One common practice is to set PCA to explain 90% of the variance ( n_components-.9 ), which helps avoid this situation while still providing valuable output.

Another option would be to use GridSearchCV and input a list of values for n_components that you would like to test. Note that this approach will also require you to use Pipeline to construct an object that will fit both your PCA and your estimator/classifier on your training data for a given point in the grid.

As an aside I will point out that PCA is not always the best choice when it comes to dimensionality reduction as there are situations where low variance principal components are of high predictive value. There are some existing CrossValidated questions that cover this quite well. Examples of PCA where PCs with low variance are “useful” and Low variance components in PCA, are they really just noise? Is there any way to test for it?

sklearn PCA with n_components = 'mle' and svd_solver = 'full' results in math domain error

Why Sklearn PCA needs more samples than new features(n_components)?

Determine the value of n_components variable in pca analysis

TypeError: PCA() got an unexpected keyword argument 'n_components'

Determine n_components of PCA such that the explained variance ratio is 0.99

How to interpret Scikit-learn's PCA when n_components are None?

LDA ignoring n_components?

Extracting PCA components with sklearn

PCA on sklearn - how to interpret pca.components_

Sklearn PCA is pca.components_ the loadings?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question sklearn PCA with n_components = 'mle' and svd_solver = 'full' results in math domain error Why Sklearn PCA needs more samples than new features(n_components)? Determine the value of n_components variable in pca analysis TypeError: PCA() got an unexpected keyword argument 'n_components' Determine n_components of PCA such that the explained variance ratio is 0.99 How to interpret Scikit-learn's PCA when n_components are None? LDA ignoring n_components? Extracting PCA components with sklearn PCA on sklearn - how to interpret pca.components_ Sklearn PCA is pca.components_ the loadings?

Related Tags

sklearn.pca() and n_components, linear algebra dilemma

Question

1 answers

solution1 0 ACCPTED 2018-07-11 18:27:09

solution1
0 ACCPTED 2018-07-11 18:27:09