简体   繁体   English

SKLearn PCA explained_variance_ration cumsum 给出数组 1

[英]SKLearn PCA explained_variance_ration cumsum gives array of 1

I have a problem with PCA.我对 PCA 有疑问。 I read that PCA needs clean numeric values.我读到 PCA 需要干净的数值。 I started my analysis with a dataset called trainDf with shape (1460, 79) .我从一个名为trainDf的数据集开始我的分析,其形状为(1460, 79)

I did my data cleaning and processing by removing empty values, imputing and dropping columns and I got a dataframe transformedData with shape (1458, 69) .我通过删除空值、输入和删除列来清理和处理数据,我得到了一个形状为(1458, 69)的 dataframe transformedData

Data cleaning steps are:数据清洗步骤为:

  1. LotFrontage imputing with mean value LotFrontage平均值
  2. MasVnrArea imputing with 0s (less than 10 cols) MasVnrArea以 0 进行插补(小于 10 列)
  3. Ordinal encoding for categorical columns分类列的顺序编码
  4. Electrical imputing with most frequent value具有最频繁值的Electrical插补

I found outliers with IQR and got withoutOutliers with shape (1223, 69) .我发现了带有withoutOutliers的离群值,并且没有形状为(1223, 69)的离群值。

After this, I looked at histograms and decided to apply PowerTransformer on some features and StandardScaler on others and I got normalizedData .在此之后,我查看了直方图并决定在某些特征上应用PowerTransformer ,在其他特征上应用StandardScaler ,我得到了normalizedData

Now I tried doing PCA and I got this:现在我尝试做 PCA,我得到了这个:

pca = PCA().fit(transformedData)

print(pca.explained_variance_ratio_.cumsum())

plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

the output of this PCA is the following:此 PCA 的 output 如下:

[0.67454179 0.8541084  0.98180307 0.99979932 0.99986346 0.9999237
 0.99997091 0.99997985 0.99998547 0.99999044 0.99999463 0.99999719
 0.99999791 0.99999854 0.99999909 0.99999961 0.99999977 0.99999988
 0.99999994 0.99999998 0.99999999 1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.        ]

PCA1

Then I tried:然后我尝试了:

pca = PCA().fit(withoutOutliers)

print(pca.explained_variance_ratio_.cumsum())

plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

out:出去:

[0.68447278 0.86982875 0.99806386 0.99983727 0.99989606 0.99994353
 0.99997769 0.99998454 0.99998928 0.99999299 0.9999958  0.99999775
 0.99999842 0.99999894 0.99999932 0.99999963 0.9999998  0.9999999
 0.99999994 0.99999998 0.99999999 1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.        ]

PCA2

Finally:最后:

pca = PCA().fit(normalizedData)

print(pca.explained_variance_ratio_.cumsum())

plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

Out:出去:

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

PCA3

How is it possible that the last execution gives such an output?最后执行怎么可能给出这样的output呢?

Here are data distributions下面是数据分布

transformedData

转换数据历史

withoutOutliers

没有异常值历史

normalizedData

归一化数据历史

I'll add any further data if necessary, thanks in advance to any who can help!如有必要,我会添加任何进一步的数据,在此先感谢任何可以提供帮助的人!

In short, all data should be scaled before applying PCA (for example using a StandardScaler ).简而言之,所有数据都应该在应用 PCA 之前进行缩放(例如使用StandardScaler )。

I got the answer on Data science stackexchange .我在Data science stackexchange上得到了答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM