[英]PCA with SKLearn and Python - computing PCA values with given components / basis vectors
I'm trying to understand what sklearn
is doing when running a PCA
.我试图了解sklearn
在运行PCA
时在做什么。 Unfortunately I don't have much knowledge with PCA
so it might be my understand is just wrong.不幸的是,我对PCA
了解不多,所以我的理解可能是错误的。
Let's have a simple example with the iris dataset:让我们用 iris 数据集做一个简单的例子:
iris = datasets.load_iris()
X = iris.data
pca.fit(X)
Xfit = pca.transform(X)
Xfit
now looks like this: Xfit
现在看起来像这样:
[[-2.68412563e+00, 3.19397247e-01, -2.79148276e-02, -2.26243707e-03], ...
I thought that to get these projected values I basically just need to build the dot product of the original values and the transposed basic vectors
/ components
.我认为要获得这些投影值,我基本上只需要构建原始值和转置的basic vectors
/ components
的点积。 So I assumed that this should give the same result:所以我认为这应该给出相同的结果:
np.dot(X, np.transpose(pca.components_))
But unfortunately this is the result:但不幸的是,这是结果:
[[ 2.81823951e+00, 5.64634982e+00, -6.59767544e-01, 3.10892758e-02],..
So my question is:所以我的问题是:
Why is there a difference?为什么会有差异? I asume the one from pca.transform(X)
is correct and I'm doing something wrong but what would I need to do if I only have the components and would like to calculate the principal component values myselfs?我假设来自pca.transform(X)
的那个是正确的,我做错了什么但是如果我只有组件并且想自己计算主成分值我需要做什么?
Alright, I've found the issue.好的,我已经找到问题了。 I have to mean-center the raw values before applying np.dot
.在应用np.dot
之前,我必须使原始值居中。 So when using only pd.DataFrame
, which makes mean-centering pretty easy, it looks like this:因此,当仅使用pd.DataFrame
时,这使得均值居中变得非常容易,它看起来像这样:
np.dot(pd.DataFrame(X)-pd.DataFrame(X).mean(), np.transpose(pd.DataFrame(pca.components_)))
and the results are the same as when using the fit function:结果与使用拟合 function 时的结果相同:
[[-2.68412563e+00, 3.19397247e-01, -2.79148276e-02, -2.26243707e-03], ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.