简体   繁体   English

在sklearn.decomposition.PCA中,为什么components_为负?

[英]In sklearn.decomposition.PCA, why are components_ negative?

I'm trying to follow along with Abdi & Williams - Principal Component Analysis (2010) and build principal components through SVD, using numpy.linalg.svd . 我正在尝试跟随Abdi&Williams - Principal Component Analysis (2010)并使用numpy.linalg.svd通过SVD构建主要组件。

When I display the components_ attribute from a fitted PCA with sklearn, they're of the exact same magnitude as the ones that I've manually computed, but some (not all) are of opposite sign. 当我从带有sklearn的拟合PCA显示components_属性时,它们的大小与我手动计算的大小完全相同,但有些 (不是全部)符号相反。 What's causing this? 是什么导致了这个?

Update : my (partial) answer below contains some additional info. 更新 :我的(部分)答案包含一些其他信息。

Take the following example data: 以下示例数据为例:

from pandas_datareader.data import DataReader as dr
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

# sample data - shape (20, 3), each column standardized to N~(0,1)
rates = scale(dr(['DGS5', 'DGS10', 'DGS30'], 'fred', 
           start='2017-01-01', end='2017-02-01').pct_change().dropna())

# with sklearn PCA:
pca = PCA().fit(rates)
print(pca.components_)
[[-0.58365629 -0.58614003 -0.56194768]
 [-0.43328092 -0.36048659  0.82602486]
 [-0.68674084  0.72559581 -0.04356302]]

# compare to the manual method via SVD:
u, s, Vh = np.linalg.svd(np.asmatrix(rates), full_matrices=False)
print(Vh)
[[ 0.58365629  0.58614003  0.56194768]
 [ 0.43328092  0.36048659 -0.82602486]
 [-0.68674084  0.72559581 -0.04356302]]

# odd: some, but not all signs reversed
print(np.isclose(Vh, -1 * pca.components_))
[[ True  True  True]
 [ True  True  True]
 [False False False]]

As you figured out in your answer, the results of a singular value decomposition (SVD) are not unique in terms of singular vectors. 正如您在答案中所发现的那样,奇异值分解(SVD)的结果在奇异向量方面并不是唯一的。 Indeed, if the SVD of X is \\sum_1^r \\s_i u_i v_i^\\top : 实际上,如果X的SVD是\\ sum_1 ^ r \\ s_i u_i v_i ^ \\ top: 在此输入图像描述

with the s_i ordered in decreasing fashion, then you can see that you can change the sign (ie, "flip") of say u_1 and v_1, the minus signs will cancel so the formula will still hold. 随着s_i以递减方式排序,那么您可以看到您可以更改说u_1和v_1的符号(即“翻转”),减号将取消,因此公式仍将保留。

This shows that the SVD is unique up to a change in sign in pairs of left and right singular vectors . 这表明SVD 在左右奇异向量对的符号变化中是唯一的。

Since the PCA is just a SVD of X (or an eigenvalue decomposition of X^\\top X), there is no guarantee that it does not return different results on the same X every time it is performed. 由于PCA只是X的SVD(或X ^ \\ top X的特征值分解),因此无法保证每次执行时它都不会在同一X上返回不同的结果。 Understandably, scikit learn implementation wants to avoid this: they guarantee that the left and right singular vectors returned (stored in U and V) are always the same, by imposing (which is arbitrary) that the largest coefficient of u_i in absolute value is positive. 可以理解的是,scikit学习实现想要避免这种情况:它们保证返回的左右奇异向量(存储在U和V中)总是相同的,通过强制(任意)绝对值的最大u_i系数为正。

As you can see reading the source : first they compute U and V with linalg.svd() . 正如您所看到的那样 :首先,他们使用linalg.svd()计算U和V. Then, for each vector u_i (ie, row of U), if its largest element in absolute value is positive, they don't do anything. 然后,对于每个向量u_i(即U行),如果其绝对值中的最大元素是正数,则它们不做任何事情。 Otherwise, they change u_i to - u_i and the corresponding left singular vector, v_i, to - v_i. 否则,它们将u_i改为 - u_i,并将相应的左奇异向量v_i改为 - v_i。 As told earlier, this does not change the SVD formula since the minus sign cancel out. 如前所述,由于减号取消,因此不会改变SVD公式。 However, now it is guaranteed that the U and V returned after this processing are always the same, since the indetermination on the sign has been removed. 但是,现在可以保证在此处理之后返回的U和V总是相同的,因为已经消除了符号上的不确定性。

With the PCA here in 3 dimensions, you basically find iteratively: 1) The 1D projection axis with the maximum variance preserved 2) The maximum variance preserving axis perpendicular to the one in 1). 在PCA这里有3个维度,你基本上可以迭代地找到:1)保留最大方差的1D投影轴2)最大方差保持轴垂直于1)中的一个。 The third axis is automatically the one which is perpendicular to first two. 第三轴自动是垂直于前两个轴的轴。

The components_ are listed according to the explained variance. 组件_根据解释的方差列出。 So the first one explains the most variance, and so on. 所以第一个解释了最大的差异,依此类推。 Note that by the definition of the PCA operation, while you are trying to find the vector for projection in the first step, which maximizes the variance preserved, the sign of the vector does not matter: Let M be your data matrix (in your case with the shape of (20,3)). 请注意,通过PCA操作的定义,当您尝试在第一步中找到投影向量时,最大化保留的方差,向量的符号无关紧要:设M为您的数据矩阵(在您的情况下)形状为(20,3))。 Let v1 be the vector for preserving the maximum variance, when the data is projected on. 当投影数据时,令v1为保持最大方差的向量。 When you select -v1 instead of v1, you obtain the same variance. 当您选择-v1而不是v1时,您将获得相同的方差。 (You can check this out). (你可以看一下)。 Then when selecting the second vector, let v2 be the one which is perpendicular to v1 and preserves the maximum variance. 然后,当选择第二个向量时,让v2成为垂直于v1的那个并保留最大方差。 Again, selecting -v2 instead of v2 will preserve the same amount of variance. 同样,选择-v2而不是v2将保留相同的方差量。 v3 then can be selected either as -v3 or v3. 然后可以选择v3作为-v3或v3。 Here, the only thing which matters is that v1,v2,v3 constitute an orthonormal basis, for the data M. The signs mostly depend on how the algorithm solves the eigenvector problem underlying the PCA operation. 这里唯一重要的是v1,v2,v3构成数据M的标准正交基础。符号主要取决于算法如何解决PCA操作背后的特征向量问题。 Eigenvalue decomposition or SVD solutions may differ in signs. 特征值分解或SVD解决方案可能在符号上有所不同。

After some digging, I've cleared up some, but not all, of my confusion on this. 经过一番挖掘,我已经清除了一些,但不是全部,我对此感到困惑。 This issue has been covered on stats.stackexchange here . 此问题已被覆盖上stats.stackexchange 这里 The mathematical answer is that "PCA is a simple mathematical transformation. If you change the signs of the component(s), you do not change the variance that is contained in the first component." 数学答案是“PCA是一个简单的数学变换。如果你改变了组件的符号,你就不会改变第一个组件中包含的方差。” However , in this case (with sklearn.PCA ), the source of ambiguity is much more specific: in the source ( line 391 ) for PCA you have: 但是 ,在这种情况下(使用sklearn.PCA ),歧义的来源更加具体:在PCA的源代码( 第391行 )中,您有:

U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)

components_ = V

svd_flip , in turn, is defined here . 反过来, svd_flip这里定义。 But why the signs are being flipped to "ensure a deterministic output," I'm not sure. 但为什么这些标志被翻转以“确保确定性输出”,我不确定。 ( U, S, V have already been found at this point...). (此时已经找到了U,S,V ......)。 So while sklearn 's implementation is not incorrect, I don't think it's all that intuitive. 因此,虽然sklearn的实现并不正确,但我认为这并不是那么直观。 Anyone in finance who is familiar with the concept of a beta (coefficient) will know that the first principal component is most likely something similar to a broad market index. 熟悉beta(系数)概念的财务人员会知道第一个主要成分很可能类似于广泛的市场指数。 Problem is, the sklearn implementation will get you strong negative loadings to that first principal component. 问题是, sklearn实现会让你对第一个主要组件产生强烈的负面影响。

My solution is a dumbed-down version that does not implement svd_flip . 我的解决方案是一个svd_flip 版本 ,没有实现svd_flip It's pretty barebones in that it doesn't have sklearn parameters such as svd_solver , but does have a number of methods specifically geared towards this purpose. 这是非常sklearn ,因为它没有sklearn参数,如svd_solver ,但确实有许多专门针对此目的的方法。

This is a short notice for those who care about the purpose and not the math part at all. 对于那些关心目的而非数学部分的人来说,这是一个简短的通知。

Although the sign is opposite for some of the components, that shouldn't be considered as a problem. 虽然某些组件的符号相反,但不应将其视为问题。 In fact what we do care about (at least to my understanding) is the axes' directions. 事实上,我们关心的事情(至少根据我的理解)是轴的方向。 The components, ultimately, are vectors that identify these axes after transforming the input data using pca. 最终,组件是在使用pca转换输入数据之后识别这些轴的向量。 Therefore no matter what direction each component is pointing to, the new axes that our data lie on will be the same. 因此,无论每个组件指向哪个方向,我们的数据所在的新轴都是相同的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM