简体   繁体   English

为什么Sklearn PCA需要的样本多于新功能(n_components)?

[英]Why Sklearn PCA needs more samples than new features(n_components)?

When using Sklearn PCA algorithm like this 当使用像这样的Sklearn PCA算法时

x_orig = np.random.choice([0,1],(4,25),replace = True)
pca = PCA(n_components=15)
pca.fit_transform(x_orig).shape

I get output 我得到输出

(4, 4)

I expected(want) it to be: 我期望(希望)它是:

(4,15)

I get why its happening. 我明白为什么会发生这种情况。 In the documentation of sklearn ( here ) it says(assuming their '==' is assignment operator): 在sklearn( 这里 )的文档中,它说(假设他们的'=='是赋值运算符):

n_components == min(n_samples, n_features)

But why are they doing this? 但他们为什么要这样做呢? Also, how can I convert an input with shape [1,25] to [1,10] directly (without stacking dummy arrays)? 另外,如何直接将形状[1,25]的输入转换为[1,10](不叠加虚拟数组)?

Each principal component is the projection of the data on an eigenvector of the data covariance matrix. 每个主成分是数据在数据协方差矩阵的特征向量上的投影。 If you have less samples n than features the covariance matrix has only n non-zero eigenvalues. 如果样本n比特征少,则协方差矩阵只有n个非零特征值。 Thus, there are only n eigenvectors/components that make sense. 因此,只有n个有意义的特征向量/组件。

In principle it could be possible to have more components than samples, but the superfluous components would be useless noise. 原则上,可能有比样品更多的组件,但多余的组件将是无用的噪音。

Scikit-learn raises an error instead of silently doing anything . Scikit-learn会引发错误而不是默默地做任何事情 This prevents users from shooting themselves in the foot. 这可以防止用户在脚下射击。 Having less samples than features can indicate a problem with the data, or a misconception about the methods involved. 样本少于特征可能表明数据存在问题,或者对所涉及的方法存在误解。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 sklearn.pca()和n_components,线性代数难题 - sklearn.pca() and n_components, linear algebra dilemma 带有 n_components = 'mle' 和 svd_solver = 'full' 的 sklearn PCA 导致数学域错误 - sklearn PCA with n_components = 'mle' and svd_solver = 'full' results in math domain error n_components 不能大于 min(n_features, n_classes - 1)。 在执行 LDA 时 - n_components cannot be larger than min(n_features, n_classes - 1). while performing LDA ValueError:n_components = 4必须介于0和min(n_samples,n_features)= 2之间,且svd_solver ='full' - ValueError: n_components=4 must be between 0 and min(n_samples, n_features)=2 with svd_solver='full' 确定pca分析中n_components变量的值 - Determine the value of n_components variable in pca analysis 类型错误:PCA() 得到了一个意外的关键字参数“n_components” - TypeError: PCA() got an unexpected keyword argument 'n_components' 确定 PCA 的 n_components 使得解释的方差比为 0.99 - Determine n_components of PCA such that the explained variance ratio is 0.99 当 n_components 为 None 时如何解释 Scikit-learn 的 PCA? - How to interpret Scikit-learn's PCA when n_components are None? LDA忽略n_components? - LDA ignoring n_components? sklearn pca的n_component等于特征数问题 - sklearn pca's n_component equals to the number of features problem
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM