简体   繁体   English

sklearn 上的 PCA - 如何解释 pca.components_

[英]PCA on sklearn - how to interpret pca.components_

I ran PCA on a data frame with 10 features using this simple code:我使用以下简单代码在具有 10 个特征的数据帧上运行 PCA:

pca = PCA()
fit = pca.fit(dfPca)

The result of pca.explained_variance_ratio_ shows: pca.explained_variance_ratio_的结果显示:

array([  5.01173322e-01,   2.98421951e-01,   1.00968655e-01,
         4.28813755e-02,   2.46887288e-02,   1.40976609e-02,
         1.24905823e-02,   3.43255532e-03,   1.84516942e-03,
         4.50314168e-16])

I believe that means that the first PC explains 52% of the variance, the second component explains 29% and so on...我相信这意味着第一台 PC 解释了 52% 的方差,第二个组件解释了 29% 等等......

What I dont undestand is the output of pca.components_ .我不为已了解的输出pca.components_ If I do the following:如果我执行以下操作:

df = pd.DataFrame(pca.components_, columns=list(dfPca.columns))

I get the data frame bellow where each line is a principal component.我得到了下面的数据框,其中每一行都是一个主成分。 What I'd like to understand is how to interpret that table.我想了解的是如何解释该表。 I know that if I square all the features on each component and sum them I get 1, but what does the -0.56 on PC1 mean?我知道如果我对每个组件上的所有特征进行平方并将它们相加,我会得到 1,但是 PC1 上的 -0.56 是什么意思? Dos it tell something about "Feature E" since it is the highest magnitude on a component that explains 52% of the variance?它是否说明了“特征 E”,因为它是解释 52% 方差的组件上的最高幅度?

在此处输入图片说明

Thanks谢谢

Terminology: First of all, the results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).术语:首先,PCA 的结果通常根据分量分数进行讨论,有时称为因子分数(对应于特定数据点的转换变量值)和载荷(每个标准化原始变量应采用的权重)乘以得到组件分数)。

PART1 : I explain how to check the importance of the features and how to plot a biplot.第 1 部分:我解释了如何检查特征的重要性以及如何绘制双标图。

PART2 : I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.第 2 部分:我解释了如何检查特征的重要性以及如何使用特征名称将它们保存到 Pandas 数据框中。

Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f文章摘要:Python 精简指南: https : //towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e ?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f


PART 1:第 1 部分:

In your case, the value -0.56 for Feature E is the score of this feature on the PC1.在您的情况下,特征 E 的值 -0.56 是该特征在 PC1 上的得分。 This value tells us 'how much' the feature influences the PC (in our case the PC1).该值告诉我们该功能对 PC(在我们的示例中是 PC1)的影响“有多大”。

So the higher the value in absolute value, the higher the influence on the principal component.所以绝对值越大,对主成分的影响越大。

After performing the PCA analysis, people usually plot the known 'biplot' to see the transformed features in the N dimensions (2 in our case) and the original variables (features).在执行 PCA 分析后,人们通常会绘制已知的“双标图”以查看 N 维(在我们的示例中为 2)的变换特征和原始变量(特征)。

I wrote a function to plot this.我写了一个函数来绘制这个。


Example using iris data:使用虹膜数据的示例

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = datasets.load_iris()
X = iris.data
y = iris.target

#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)   

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]

    plt.scatter(xs ,ys, c = y) #without scaling
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')

plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. 
myplot(x_new[:,0:2], pca. components_) 
plt.show()

Results结果

在此处输入图片说明

PART 2:第 2 部分:

The important features are the ones that influence more the components and thus, have a large absolute value on the component.重要的特征是那些对组件影响更大的特征,因此对组件具有很大的绝对值。

TO get the most important features on the PCs with names and save them into a pandas dataframe use this:使用名称获取 PC 上最重要的功能并将它们保存到Pandas 数据框中,请使用以下命令:

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(dic.items())

This prints:这打印:

     0  1
 0  PC0  e
 1  PC1  d

So on the PC1 the feature named e is the most important and on PC2 the d .所以在PC1命名的功能e是最重要和PC2的d

Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f文章摘要: Python 紧凑指南: https : //towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e ?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

Basic Idea基本理念

The Principle Component breakdown by features that you have there basically tells you the "direction" each principle component points to in terms of the direction of the features.按您拥有的功能细分的主要组件基本上会告诉您每个主要组件在功能方向方面指向的“方向”。

In each principle component, features that have a greater absolute weight "pull" the principle component more to that feature's direction.在每个主成分中,具有更大绝对权重的特征会将主成分更多地“拉”向该特征的方向。

For example, we can say that in PC1, since Feature A, Feature B, Feature I, and Feature J have relatively low weights (in absolute value), PC1 is not as much pointing in the direction of these features in the feature space.例如,我们可以说在 PC1 中,由于 Feature A、Feature B、Feature I 和 Feature J 具有相对较低的权重(绝对值),因此 PC1 在特征空间中没有那么多指向这些特征的方向。 PC1 will be pointing most to the direction of Feature E relative to other directions.相对于其他方向,PC1 将最指向特征 E 的方向。

Visualization in Lower Dimensions较低维度的可视化

For a visualization of this, look at the following figures taken from here and here :要对此进行可视化,请查看取自此处此处的以下数字:

The following shows an example of running PCA on correlated data.下面显示了对相关数据运行 PCA 的示例。 在此处输入图片说明

We can visually see that both eigenvectors derived from PCA are being "pulled" in both the Feature 1 and Feature 2 directions.我们可以直观地看到,从 PCA 导出的两个特征向量都在特征 1 和特征 2 两个方向上被“拉”。 Thus, if we were to make a principle component breakdown table like you made, we would expect to see some weightage from both Feature 1 and Feature 2 explaining PC1 and PC2.因此,如果我们像您制作的那样制作主成分分解表,我们希望从特征 1 和特征 2 中看到一些权重来解释 PC1 和 PC2。

Next, we have an example with uncorrelated data.接下来,我们有一个不相关数据的例子。

在此处输入图片说明

Let us call the green principle component as PC1 and the pink one as PC2.让我们将绿色的主成分称为 PC1,将粉红色的称为 PC2。 It's clear that PC1 is not pulled in the direction of feature x', and as isn't PC2 in the direction of feature y'.很明显,PC1 没有被拉向特征 x' 的方向,PC2 也没有被拉向特征 y' 的方向。 Thus, in our table, we must have a weightage of 0 for feature x' in PC1 and a weightage of 0 for feature y' in PC2.因此,在我们的表中,PC1 中特征 x' 的权重必须为 0,PC2 中特征 y' 的权重必须为 0。

I hope this gives an idea of what you're seeing in your table.我希望这能让您了解您在表格中看到的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM