[英]PCA analysis considering N-less relevant components
I am trying to learn the basics of PCA analysis in Python using scikit libraries (in particular sklearn.decomposition and sklearn.preprocessing). 我正在尝试使用scikit库(尤其是sklearn.decomposition和sklearn.preprocessing)学习Python中PCA分析的基础。 The goal is to import data from images into a matrix X (each row is a sample, each column is a feature), then standardize X, use PCA to extract principal components (2 most important, 6 most important....6 less important), project X on these principal components, reverse the previous transformation and plot the result in order to see the difference with respect to the original image/images.
目标是将图像中的数据导入矩阵X(每行是一个样本,每列是一个要素),然后标准化X,使用PCA提取主要成分(2个最重要,6个最重要.... 6少重要),将X投影在这些主要成分上,逆转先前的转换并绘制结果以查看相对于原始图像的差异。
Now let's say that I do not want to consider the 2,3,4... most important principal components but I want to consider the N less relevant components, let's say N=6. 现在让我们说我不想考虑2,3,4 ...最重要的主成分,但是我想考虑N个不那么相关的成分,比如说N = 6。
How should the analysis be done? 应该如何进行分析? I mean I can't simply standardize then call PCA().fit_transform and then revert back with inverse_transform() to plot the results.
我的意思是我不能简单地进行标准化,然后调用PCA()。fit_transform,然后使用inverse_transform()还原以绘制结果。
At the moment I am doing something like this: 目前,我正在执行以下操作:
X_std = StandardScaler().fit_transform(X) # standardize original data
pca = PCA()
model = pca.fit(X_std) # create model with all components
Xprime = model.components_[range(dim-6, dim, 1),:] # get last 6 PC
And then I stop because I know I should call transform() but I do not understand how to do it...I tried several times withouth being successfull. 然后我停下来,因为我知道应该调用transform(),但是我不知道该怎么做...我尝试了几次却没有成功。
Is there someone that can tell me if previous steps are correct and point out the direction to follow? 是否有人可以告诉我以前的步骤是否正确并指出要遵循的方向?
Thank you very much 非常感谢你
EDIT: currently I have adapted this solution as suggested by the first answer to my question: 编辑:目前,我已经对我的问题的第一个答案建议了此解决方案:
model = PCA().fit(X_std)
model2pc = model
model2pc.components_[range(2, img_count, 1), :] = 0
Xp_2pc = model2pc.transform(X_std)
Xr_2pc = model2pc.inverse_transform(Xp_2pc)
And then I do the same for 6pc, 60pc, last 6 pc. 然后我对6pc,60pc,最后6pc进行相同操作。 What I have noticed is that this is very time consuming.
我注意到的是,这非常耗时。 I would like to get a model directly extracting the principal components I need (without zeroing out the others) and then perform transform() and inverse_transform() on that with that model.
我想得到一个直接提取我需要的主要成分的模型(不将其他成分清零),然后对该模型执行transform()和inverse_transform()。
If you want to ignore all but the last 6 principal components, you can just zero out the ones you don't want to keep. 如果要忽略除最后6个主成分以外的所有成分,则可以将不想保留的成分归零。
N = 6
X_std = StandardScaler().fit_transform(X)
pca = PCA()
model = pca.fit(X_std) # create model with all components
model.components_[:-N] = 0
Then, to remove all but the last N
components from the data, just do a forward and inverse transform of the data: 然后,要从数据中除去最后
N
分量,只需对数据进行正向和逆向转换:
Xprime = model.inverse_transform(model.transform(X_std))
Here is an example: 这是一个例子:
>>> X = np.random.rand(18).reshape(6, 3)
>>> model = PCA().fit(X)
A round-trip transform should give back the original data: 往返转换应返回原始数据:
>>> X
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
>>> model.inverse_transform(model.transform(X))
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
Now zero out the first principal component: 现在将第一个主成分归零:
>>> model.components_
array([[ 0.22969899, 0.21209762, 0.94986998],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
>>> model.components_[:-2] = 0
>>> model.components_
array([[ 0. , 0. , 0. ],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
The round-trip transform now gives a different result since we've removed the first principal component (which contains the greatest amount of variance): 现在,由于我们删除了第一个主成分(包含最大的方差),因此往返转换产生了不同的结果:
>>> model.inverse_transform(model.transform(X))
array([[ 0.12742811, -0.01189858, 0.68108405],
[ 0.36513945, 0.33308073, 0.54656949],
[ 0.58029482, 0.33392119, 0.49435263],
[ 0.39987803, 0.35478779, 0.53332196],
[ 0.71114004, 0.56787176, 0.41047233],
[ 0.44000711, 0.16692583, 0.56556581]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.