简体   繁体   中英

sklearn.pca() and n_components, linear algebra dilemma

Say I want to find the optimal number of components when doing PCA in Python3 with sklearn.

I'd do that by iterating over some n_components and computing total absolute prediction error for each value when validating the model.

My question would be, what's the difference between passing a n_components parameter to PCA and going from there, as opposed to not passing it and only using the first (i) components from the implicit max value it gets.

My linear algebra is a bit shaky, but if I recall correctly the single vectors should be the same in both situations, and ordered in ascending order, and provide the same amount of explained variance.

Sorry for not providing any code nor writing up both scenarios to test them myself, but I'm on a long train ride and my laptop battery ran out mid-process. Now I'm stuck with the curiosity.

Your recollection of PCA is correct. The singular values will be the same for each component included.

Consider the following thought experiment. You have a small number of features. Fitting a full PCA and iterating to find the value for n_components that creates the optimal transformation for your estimator/classifier is trivial. You now have 1,000 features in your data. 10,000? 100,000? 1,000,000? See where I am going? A full PCA of such data would be both frivolous and computationally expensive. And that is before iterating through to find your optimal transformation.

One common practice is to set PCA to explain 90% of the variance ( n_components-.9 ), which helps avoid this situation while still providing valuable output.

Another option would be to use GridSearchCV and input a list of values for n_components that you would like to test. Note that this approach will also require you to use Pipeline to construct an object that will fit both your PCA and your estimator/classifier on your training data for a given point in the grid.

As an aside I will point out that PCA is not always the best choice when it comes to dimensionality reduction as there are situations where low variance principal components are of high predictive value. There are some existing CrossValidated questions that cover this quite well. Examples of PCA where PCs with low variance are “useful” and Low variance components in PCA, are they really just noise? Is there any way to test for it?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM