[英]Cumulative Explained Variance for PCA in Python
I have a simple R script for running FactoMineR's PCA on a tiny dataframe in order to find the cumulative percentage of variance explained for each variable: 我有一个简单的R脚本,用于在一个很小的数据帧上运行FactoMineR的PCA ,以便找到为每个变量解释的累积方差百分比:
library(FactoMineR)
a <- c(1, 2, 3, 4, 5)
b <- c(4, 2, 9, 23, 3)
c <- c(9, 8, 7, 6, 6)
d <- c(45, 36, 74, 35, 29)
df <- data.frame(a, b, c, d)
df_pca <- PCA(df, ncp = 4, graph=F)
print(df_pca$eig$`cumulative percentage of variance`)
Which returns: 哪个返回:
> print(df_pca$eig$`cumulative percentage of variance`)
[1] 58.55305 84.44577 99.86661 100.00000
I'm trying to do the same in Python using scikit-learn's decomposition package as follows: 我正在尝试使用scikit-learn的分解包在Python中执行以下操作:
import pandas as pd
from sklearn import decomposition, linear_model
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca = decomposition.PCA(n_components = 4)
pca.fit(df)
transformed_pca = pca.transform(df)
# sum cumulative variance from each var
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
But this results in: 但这导致:
[0.79987089715487936, 0.99224337624509307, 0.99997254568237226, 1.0]
As you can see, both correctly add up to 100%, but it seems the contributions of each variable differ between the R and Python versions. 如您所见,两个变量的总和正确为100%,但是R和Python版本之间每个变量的贡献似乎不同。 Does anyone know where these differences are coming from or how to correctly replicate the R results in Python?
有谁知道这些差异的来源或如何在Python中正确复制R结果?
EDIT: Thanks to Vlo, I now know that the differences stem from the FactoMineR PCA function scaling the data by default. 编辑:感谢Vlo,我现在知道差异来自FactoMineR PCA函数默认情况下缩放数据。 By using the sklearn preprocessing package (pca_data = preprocessing.scale(df)) to scale my data before running PCA, my results match the
通过在运行PCA之前使用sklearn预处理软件包(pca_data = preprocessing.scale(df))缩放数据,我的结果与
Thanks to Vlo, I learned that the differences between the FactoMineR PCA function and the sklearn PCA function is that the FactoMineR one scales the data by default. 多亏了Vlo,我才知道FactoMineR PCA功能和sklearn PCA功能之间的差异在于FactoMineR默认情况下会缩放数据。 By simply adding a scaling function to my python code, I was able to reproduce the results.
通过简单地在我的python代码中添加缩放函数,我便能够重现结果。
import pandas as pd
from sklearn import decomposition, preprocessing
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
e = [35, 84, 3, 54, 68]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca_data = preprocessing.scale(df)
pca = decomposition.PCA(n_components = 4)
pca.fit(pca_data)
transformed_pca = pca.transform(pca_data)
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
Output: 输出:
[0.58553054049052267, 0.8444577483783724, 0.9986661265687754, 0.99999999999999978]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.