简体   繁体   English

不确定 sklearn 中的 PCA

[英]Not sure about PCA in sklearn

I need to do some PCA using sklearn and I want to make sure I do it the right way.我需要使用 sklearn 做一些 PCA,我想确保我以正确的方式做。 Here is my code:这是我的代码:

from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca_result = pca.fit_transform(data)

eigenvalues = pca.singular_values_
print(eigenvalues)

x = pca_result[:,0]
y = pca_result[:,1]

The data looks like this:数据如下所示:

[[ -6.4186, -14.3534,  18.1296,  -2.8110,  14.0298],
[ -7.1220, -17.1501,  21.2807,  -3.5025,  16.4489],
[ -8.4652, -18.9316,  25.0303,  -4.1773,  18.5066],
...,
[ -4.7054,   6.1389,   3.5146,  -0.1036,  -0.7332],
[ -5.8533,   9.9087,   4.1178,  -0.5211,  -2.2415],
[ -6.2969,  13.8951,   3.4365,  -0.9207,  -4.2024]]

These are the eigenvalues: [1005.2761 , 853.5491 , 65.058365, 49.994457, 10.277865] .这些是特征值: [1005.2761 , 853.5491 , 65.058365, 49.994457, 10.277865] I am not totally sure about the last 2 lines.我不完全确定最后两行。 I want to plot the data projected in the 2D space that seems to make up for most of the variation in the data (basically make a 2D plot of the 5D data, as it seems like it lives on a 2D manifold).我想绘制投影在 2D 空间中的数据,这些数据似乎可以弥补数据中的大部分变化(基本上制作 5D 数据的 2D 图,因为它似乎存在于 2D 流形上)。 Am I doing it right?我做得对吗? Thank you!谢谢!

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components主成分分析 (PCA) 是一种统计过程,它使用正交变换将一组可能相关的变量(每个实体都具有各种数值的实体)的观测值转换为一组称为主成分的线性不相关变量的值

Such dimensionality reduction can be a very useful step for visualising and processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible.这种降维对于可视化和处理高维数据集是非常有用的步骤,同时仍然尽可能多地保留数据集中的方差。 For example, selecting L = 2 and keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains clusters these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram;例如,选择 L = 2 并仅保留前两个主成分,通过高维数据集找到数据分布最广的二维平面,因此如果数据包含集群,这些也可能分布最广,因此最明显的是在二维图中绘制出来; whereas if two directions through the data (or two of the original variables) are chosen at random, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable.而如果随机选择通过数据的两个方向(或两个原始变量),则聚类彼此之间的分布可能会小得多,并且实际上更有可能相互重叠,从而使它们无法区分。

https://en.wikipedia.org/wiki/Principal_component_analysis https://en.wikipedia.org/wiki/Principal_component_analysis

So you need to run:所以你需要运行:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)

x = pca_result[:,0]
y = pca_result[:,1]

Then you have a two dimensional space.然后你有一个二维空间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM