简体   繁体   中英

Determine the value of n_components variable in pca analysis

Have a nice day. Please help me. I have a normalized file. This file consists of 21 numeric columns.

I will apply pca analysis to this file as below :

pca = decomposition.PCA(n_components=21)
pca_output = pca.fit_transform(pca_matrix)
pca_inverse = pca.inverse_transform(pca_output)

As far as I understand, the value I assign to the n_components variable is equal to the number of columns. But what I do not understand is how do I determine the n_components variable.

It is a hyperparameter and finding its optimal value depends on what you want to do with your data. Let me describe 3 possible uses:

  • Visualization : 2 or 3 are probably the most sensible options:)
  • Compression : Here the goal is to simply decrease the number of features without loosing too much information. You can fit all components ( n_components=None ). Then inspect the attribute explained_variance_ratio_ and decide how many you are willing to drop. Or you can put n_components='mle' and let the data decide for you.
  • Preprocessing : Here the dimensionality reduction is a first step of some pipepline (preceding regression/classification). As opposed to compression, you want to use the transformed features as input to a supervised learning algorithm. I would recommend finding the optimal n_components through a GridSearchCV over both the PCA's n_components and the predictive model's hyperparameters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM