简体繁体 English

sklearn.pca（）和n_components，线性代数难题

[英]sklearn.pca() and n_components, linear algebra dilemma

原文 2018-07-11 17:18:22 2 1 python/ scikit-learn/ data-science/ pca/ data-analysis

Say I want to find the optimal number of components when doing PCA in Python3 with sklearn. 假设我想在使用sklearn的Python3中进行PCA时找到最佳组件数量。

I'd do that by iterating over some n_components and computing total absolute prediction error for each value when validating the model. 我可以通过迭代一些n_components并在验证模型时为每个值计算总的绝对预测误差来做到这一点。

My question would be, what's the difference between passing a n_components parameter to PCA and going from there, as opposed to not passing it and only using the first (i) components from the implicit max value it gets. 我的问题是，将n_components参数传递到PCA并从那里去有什么区别，而不是不传递它，而仅使用它从隐式最大值中获得的第一个（i）分量。

My linear algebra is a bit shaky, but if I recall correctly the single vectors should be the same in both situations, and ordered in ascending order, and provide the same amount of explained variance. 我的线性代数有点动摇，但如果我没记错的话，在两种情况下单个向量都应该相同，并按升序排列，并提供相同数量的解释方差。

Sorry for not providing any code nor writing up both scenarios to test them myself, but I'm on a long train ride and my laptop battery ran out mid-process. 很抱歉，没有提供任何代码，也没有编写两种情况来测试它们自己，但是我坐火车很长，笔记本电脑的电池在过程中用光了。 Now I'm stuck with the curiosity. 现在，我一直保持好奇心。

1 个解决方案

Your recollection of PCA is correct. 您对PCA的记忆是正确的。 The singular values will be the same for each component included. 对于所包含的每个组件，奇异值将相同。

Consider the following thought experiment. 考虑以下思想实验。 You have a small number of features. 您具有少量功能。 Fitting a full PCA and iterating to find the value for n_components that creates the optimal transformation for your estimator/classifier is trivial. 拟合完整的PCA并迭代以找到可为您的估计器/分类器创建最佳转换的n_components的值是微不足道的。 You now have 1,000 features in your data. 现在，您的数据中有1,000个功能。 10,000? 10,000？ 100,000? 十万？ 1,000,000? 1,000,000？ See where I am going? 看到我要去哪里？ A full PCA of such data would be both frivolous and computationally expensive. 此类数据的完整PCA既琐碎又计算量大。 And that is before iterating through to find your optimal transformation. 这就是在遍历找到最佳转换之前。

One common practice is to set PCA to explain 90% of the variance ( n_components-.9 ), which helps avoid this situation while still providing valuable output. 一种常见的做法是设置PCA来解释90％的方差（ n_components-.9 ），这有助于避免这种情况，同时仍然提供有价值的输出。

Another option would be to use GridSearchCV and input a list of values for n_components that you would like to test. 另一个选择是使用GridSearchCV并输入要测试的n_components的值列表。 Note that this approach will also require you to use Pipeline to construct an object that will fit both your PCA and your estimator/classifier on your training data for a given point in the grid. 请注意，此方法还将要求您使用“ Pipeline来构造一个对象，该对象既适合PCA又适合网格中给定点的训练数据上的估计器/分类器。

As an aside I will point out that PCA is not always the best choice when it comes to dimensionality reduction as there are situations where low variance principal components are of high predictive value. 顺便说一句，我将指出PCA在降维方面并不总是最佳选择，因为在某些情况下，低方差主成分具有较高的预测价值。 There are some existing CrossValidated questions that cover this quite well. 有一些现有的CrossValidated问题可以很好地解决这一问题。 Examples of PCA where PCs with low variance are “useful” and Low variance components in PCA, are they really just noise? PCA的示例中，低方差的PC是“有用的”，而PCA中的低方差分量真的是噪声吗？ Is there any way to test for it? 有什么方法可以测试吗？