简体   繁体   English

Python PCA sklearn

[英]Python PCA sklearn

I'm trying to apply a PCA dimensionality reduction to a dataset that it's 684 x 1800 (observations x features).我正在尝试将 PCA 降维应用于 684 x 1800(观察 x 特征)的数据集。 I want to reduce the amount of features.我想减少功能的数量。 When I perfom the PCA, it tells me that to obtain the 100% of variance explained, there should be 684 features, so my data should be 684 x 684. Is it not too strange?当我执行 PCA 时,它告诉我要获得解释的 100% 方差,应该有 684 个特征,所以我的数据应该是 684 x 684。是不是太奇怪了? I mean, exactly the same number...我的意思是,完全相同的数字...

Is there any explanation or I'm applying the PCA wrongly?是否有任何解释或我错误地应用了 PCA?

I know that there're needed 684 components to explain the whole variance cause I plot the cumulative sum of.explained_variance_ratio and it sums 1 with 684 components.我知道需要 684 个分量来解释整个方差,因为我 plot 是 .explained_variance_ratio 的累积总和,它与 1 和 684 个分量相加。 And also because of the code below.也因为下面的代码。

My code is basically:我的代码基本上是:

pca = PCA(0.99999999999)
pca.fit(data_rescaled)
reduced = pca.transform(data_rescaled)
print(reduced.shape)
print(pca.n_components_)

Of course, I don't want to keep the whole variance, 95% is also acceptable.当然,我不想保留整个方差,95% 也是可以接受的。 It is just a wonderful serendipity?这只是一个奇妙的意外吗?

Thank you so much太感谢了

You are using PCA correctly, and this is expected behavior.您正确使用PCA ,这是预期的行为。 The explanation for this is connected with the underlying maths behind PCA, and it certainly is not a coincidence that 100% of the variance would be explained with 684 components, which is the number of observations.对此的解释与 PCA 背后的基础数学有关,用 684 个分量(即观察数)解释 100% 的方差当然不是巧合。

There is this theorem in algebra that tells you that if you have a matrix A of dimensions (n, m) , then rank(A) <= min(n, m) .代数中有这个定理告诉你,如果你有一个维度为(n, m)的矩阵A ,那么rank(A) <= min(n, m) In your case, the rank of your data matrix is at most 684, which is the number of observations.在您的情况下,您的数据矩阵的秩最多为 684,即观察次数。 Why is this relevant?为什么这是相关的? Because this tells you that essentially, you could rewrite your data in such a way that at most 684 of your features would be linearly independent, meaning that all remaining features would be linear combinations of the others.因为这从本质上告诉您,您可以重写数据,使最多 684 个特征是线性独立的,这意味着所有剩余特征都是其他特征的线性组合。 In this new space, you could therefore keep all information about your sample with no more than 684 features.因此,在这个新空间中,您可以保留有关您的样本的所有信息,其中不超过 684 个特征。 This is also what the PCA does.这也是 PCA 所做的。

To sum it up, what you observed is just a mathematical property of the PCA decomposition.总而言之,您观察到的只是 PCA 分解的数学属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM