简体   繁体   English

如何在 sklearn PCA 中获得贡献和平方余弦?

[英]How to get contributions and squared cosines in sklearn PCA?

Working primarily based on this paper I want to implement the various PCA interpretation metrics mentioned - for example cosine squared and what the article calls contribution.主要基于 本文工作,我想实现提到的各种 PCA 解释指标——例如余弦平方和文章所说的贡献。

However the nomenclature here seems very confusing, namely it's not clear to me what exactly sklearns pca.components_ is.然而这里的命名法似乎很混乱,即我不清楚 sklearns pca.components_是什么。 I've seen some answers here and in various blogs stating that these are loadings while others state it's component scores (which I assume is the same thing as factor scores).我在这里和各种博客中看到了一些答案,指出这些是负载,而其他 state 它是组件分数(我认为这与因子分数相同)。

The paper defines contribution (of observation to component) as:该论文将贡献(对组件的观察)定义为:

ctr

and states all contributions for each component must add to 1, which is not the case assuming pca.explained_variance_ is the eigenvalues and pca.components_ are the factor scores:并声明每个分量的所有贡献必须加到 1,假设pca.explained_variance_是特征值并且pca.components_是因子得分,情况并非如此:

df = pd.DataFrame(data = [
[0.273688,0.42720,0.65267],
[0.068685,0.008483,0.042226],
[0.137368, 0.025278,0.063490],
[0.067731,0.020691,0.027731],
[0.067731,0.020691,0.027731]
], columns = ["MeS","EtS", "PrS"])

pca = PCA(n_components=2)
X = pca.fit_transform(df)
ctr=(pd.DataFrame(pca.components_.T**2)).div(pca.explained_variance_)
np.sum(ctr,axis=0)
# Yields random values 0.498437 and 0.725048

How can I calculate these metrics?如何计算这些指标? The paper defines cosine squared similarly as:该论文将余弦平方定义为:

ctr

This paper does not play well with sklearn as far as definitions are concerned.就定义而言,这篇论文与 sklearn 配合得不好。

The pca.components_ are the two principal components of your data after your data is centered. pca.components_是数据居中后数据的两个主要组成部分。 And pca.fit_transform(df) gives you the components of your centered data set w.r.t. pca.fit_transform(df)为您提供居中数据集 w.r.t 的组件。 those two principal components, ie, the factor scores.这两个主成分,即因子得分。

> pca.fit_transform(df)
array([[ 0.60781787, -0.00280834],
       [-0.1601333 , -0.01246807],
       [-0.11667497,  0.04584743],
       [-0.1655048 , -0.01528551],
       [-0.1655048 , -0.01528551]])

Next, the lambda_l of equation (10) in the paper is just the sum of the squares of the factor scores for the l-th component, ie l-th column of pca.fit_transform(df) .接下来,论文中方程(10)的 lambda_l 只是第 l 个分量的因子分数的平方和,即pca.fit_transform(df)的第 l 列。 But pca.explained_variance_ gives you the two variances , and since sklearn uses as degrees of freedom the value len(df.index) - 1 , we have lambda_l == (len(df.index)-1) pca.explained_variance_[l] .但是pca.explained_variance_给了你两个方差,并且由于 sklearn 使用值len(df.index) - 1作为自由度,我们有lambda_l == (len(df.index)-1) pca.explained_variance_[l] .

> X = pca.fit_transform(df)
> lmbda = np.sum(X**2, axis = 0)
> lmbda
array([0.46348196, 0.00273262])

> (5-1) * pca.explained_variance_
array([0.46348196, 0.00273262])

Thus, as a summary, I recommend computing the contributions as:因此,作为总结,我建议将贡献计算为:

> ctr = X**2 / np.sum(X**2, axis = 0)

For the squared cosine it's the same except that we sum over the rows of pca.fit_transform(df) :对于平方余弦,它是相同的,除了我们对pca.fit_transform(df)的行求和:

> cos_sq = X**2 / np.sum(X**2, axis = 1)[:, np.newaxis]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM