简体   繁体   English

Plot PCA 加载和 sklearn 双标图中的加载(类似于 R 的自动绘图)

[英]Plot PCA loadings and loading in biplot in sklearn (like R's autoplot)

I saw this tutorial in R w/ autoplot .我在R w/ autoplot中看到了这个教程。 They plotted the loadings and loading labels:他们绘制了载荷和载荷标签:

autoplot(prcomp(df), data = iris, colour = 'Species',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

在此处输入图像描述 https://cran.r-project.org/web/packages/ggfortify/vi.nettes/plot_pca.html https://cran.r-project.org/web/packages/ggfortify/vi.nettes/plot_pca.html

I prefer Python 3 w/ matplotlib, scikit-learn, and pandas for my data analysis.对于我的数据分析,我更喜欢Python 3 w/ matplotlib, scikit-learn, and pandas However, I don't know how to add these on?但是,我不知道如何添加这些?

How can you plot these vectors w/ matplotlib ?你怎么能 plot 这些向量 w/ matplotlib

I've been reading Recovering features names of explained_variance_ratio_ in PCA with sklearn but haven't figured it out yet我一直在阅读Recovering features names of explained_variance_ratio_ in PCA with sklearn但还没有弄明白

Here's how I plot it in Python这是我在 plot 中的Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})

%matplotlib inline
np.random.seed(0)

# Iris dataset
DF_data = pd.DataFrame(load_iris().data, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
                       columns = load_iris().feature_names)

Se_targets = pd.Series(load_iris().target, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])], 
                       name = "Species")

# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data), 
                           index = DF_data.index,
                           columns = DF_data.columns)

# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = 2

# PCA (How I tend to set it up)
Mod_PCA = decomposition.PCA(n_components=m)
DF_PCA = pd.DataFrame(Mod_PCA.fit_transform(DF_standard), 
                      columns=["PC%d" % k for k in range(1,m + 1)]).iloc[:,:K]
# Color classes
color_list = [{0:"r",1:"g",2:"b"}[x] for x in Se_targets]

fig, ax = plt.subplots()
ax.scatter(x=DF_PCA["PC1"], y=DF_PCA["PC2"], color=color_list)

在此处输入图像描述

You could do something like the following by creating a biplot function.您可以通过创建biplot函数来执行以下操作。

Nice article here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f不错的文章在这里: https : //towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk= 65bf5440e444c24aff192fedf9f8b64f

In this example I am using the iris data:在这个例子中,我使用的是虹膜数据:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

# In general, it's a good idea to scale the data prior to PCA.
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)    
pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
    plt.xlim(-1,1)
    plt.ylim(-1,1)
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))
    plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

RESULT结果

双标结果


Try the 'pca' library.试试“pca”库。 This will plot the explained variance, and create a biplot.这将绘制解释的方差,并创建一个双标图。

pip install pca

from pca import pca

# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)

# Or reduce the data towards 2 PCs
model = pca(n_components=2)

# Fit transform
results = model.fit_transform(X)

# Plot explained variance
fig, ax = model.plot()

# Scatter first 2 PCs
fig, ax = model.scatter()

# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)

我在这里找到了@teddyroland 的答案: https : //github.com/teddyroland/python-biplot/blob/master/biplot.py

I'd like to add a generic solution to this topic.我想为这个主题添加一个通用的解决方案。 After doing some careful research on existing solutions (including Python and R) and datasets (especially biological "omic" datasets).在对现有解决方案(包括 Python 和 R)和数据集(尤其是生物“组学”数据集)进行仔细研究之后。 I figured out the following Python solution, which has the advantages of:我想出了以下 Python 解决方案,它具有以下优点:

  1. Scale the scores (samples) and loadings (features) properly to make them visually pleasing in one plot. It should be pointed out that the relative scales of samples and features do not have any mathematical meaning (but their relative directions have), however, making them similarly sized can facilitate exploration.适当缩放scores(samples)和loadings(features),使它们在一个plot中具有视觉上的赏心悦目。需要指出的是,samples和features的相对缩放比例没有任何数学意义(但它们的相对方向有),但是,使它们大小相似可以促进探索。

  2. Can handle high-dimensional data where there are many features and one could only afford to visualize the top several features (arrows) that drive the most variance of data.可以处理具有许多特征的高维数据,并且只能可视化驱动最大数据方差的前几个特征(箭头)。 This involves explicit selection and scaling of the top features.这涉及对顶级特征的显式选择和缩放。

An example of final output (using " Moving Pictures ", a classical dataset in my research field): final output的例子(使用我研究领域的经典数据集“ Moving Pictures ”):

movpic_biplot

Preparation:准备:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Basic example: displaying all features (arrows)基本示例:显示所有功能(箭头)

We will use the iris dataset (150 samples by 4 features).我们将使用 iris 数据集(150 个样本 x 4 个特征)。

# load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
targets = iris.target_names
features = iris.feature_names

# standardization
X_scaled = StandardScaler().fit_transform(X)

# PCA
pca = PCA(n_components=2).fit(X_scaled)
X_reduced = pca.transform(X_scaled)

# coordinates of samples (i.e., scores; let's take the first two axes)
scores = X_reduced[:, :2]

# coordinates of features (i.e., loadings; note the transpose)
loadings = pca.components_[:2].T

# proportions of variance explained by axes
pvars = pca.explained_variance_ratio_[:2] * 100

Here comes the critical part: Scale the features (arrows) properly to match the samples (points).关键部分来了:适当缩放特征(箭头)以匹配样本(点)。 The following code scales by the maximum absolute value of samples on each axis.以下代码按每个轴上样本的最大绝对值进行缩放。

arrows = loadings * np.abs(scores).max(axis=0)

Another way, as discussed in seralouk's answer, is to scale by range (max - min).正如 seralouk 的回答中所讨论的,另一种方法是按范围(最大 - 最小)进行缩放。 But it will make the arrows larger than points.但它会使箭头大于点。

# arrows = loadings * np.ptp(scores, axis=0)

Then plot out the points and arrows:然后 plot 出点和箭头:

plt.figure(figsize=(5, 5))

# samples as points
for i, name in enumerate(targets):
    plt.scatter(*zip(*scores[y == i]), label=name)
plt.legend(title='Species')

# empirical formula to determine arrow width
width = -0.0075 * np.min([np.subtract(*plt.xlim()), np.subtract(*plt.ylim())])

# features as arrows
for i, arrow in enumerate(arrows):
    plt.arrow(0, 0, *arrow, color='k', alpha=0.5, width=width, ec='none',
              length_includes_head=True)
    plt.text(*(arrow * 1.05), features[i],
             ha='center', va='center')

# axis labels
for i, axis in enumerate('xy'):
    getattr(plt, f'{axis}ticks')([])
    getattr(plt, f'{axis}label')(f'PC{i + 1} ({pvars[i]:.2f}%)')

虹膜双标图

Compare the result with that of the R solution.将结果与 R 解决方案的结果进行比较。 You can see that they are quite consistent.你可以看到它们非常一致。 (Note: it is known that PCAs of R and scikit-learn have opposite axes. You can flip one of them to make the directions consistent.) (注:已知R和scikit-learn的PCA轴相反,可以翻转其中一个使方向一致。)

iris.pca <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
biplot(iris.pca, scale = 0)

虹膜_r

library(ggfortify)
autoplot(iris.pca, data = iris, colour = 'Species',
         loadings = TRUE, loadings.colour = 'dimgrey',
         loadings.label = TRUE, loadings.label.colour = 'black')

iris_auto

Advanced example: displaying only top k features高级示例:仅显示前k 个特征

We will use the digits dataset (1797 samples by 64 features).我们将使用数字数据集(64 个特征的 1797 个样本)。

# load data
digits = datasets.load_digits()
X = digits.data
y = digits.target
targets = digits.target_names
features = digits.feature_names

# analysis
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=2).fit(X_scaled)
X_reduced = pca.transform(X_scaled)

# results
scores = X_reduced[:, :2]
loadings = pca.components_[:2].T
pvars = pca.explained_variance_ratio_[:2] * 100

Now, we will find the top k features that best explain our data.现在,我们将找到最能解释我们数据的前k 个特征。

k = 8

Method 1: Find top k arrows that appear the longest (ie, furthest from the origin) in the visible plot:方法一:找到可见plot中出现最长(即距离原点最远)的top k个箭头:

  • Note that all features are equally long in the m by m space.请注意,所有特征在m x m空间中的长度都相等。 But they are different in the 2 by m space ( m is the total number of features), and the following code is to find the longest ones in the latter.但是在2× m的空间( m是特征总数)上是不同的,下面的代码是在后者中寻找最长的。
  • This method is consistent with the microbiome program QIIME 2 / EMPeror ( source code ).此方法与微生物组程序 QIIME 2/EMPeror( 源代码)一致。
tops = (loadings ** 2).sum(axis=1).argsort()[-k:]
arrows = loadings[tops]

Method 2: Find top k features that drive most variance in the visible PCs:方法 2:找到在可见 PC 中驱动最大方差的前k个特征:

# tops = (loadings * pvars).sum(axis=1).argsort()[-k:]
# arrows = loadings[tops]

Now there is a new problem: When the feature number is large, because the top k features are only a very small portion of all features, their contribution to data variance is tiny, thus they will look tiny in the plot.现在有一个新问题:当特征数很大时,由于top k个特征只占所有特征的很小一部分,它们对数据方差的贡献很小,因此在plot中看起来很小。

To solve this, I came up with the following code.为了解决这个问题,我想出了以下代码。 The rationale is: For all features, the sum of square loadings is always 1 per PC.基本原理是:对于所有特征,每台 PC 的平方载荷总和始终为 1。 With a small portion of features, we should bring them up such that the sum of square loadings of them is also 1. This method is tested and working, and generates nice plots.对于一小部分特征,我们应该将它们调高,使它们的平方载荷之和也为 1。此方法经过测试并有效,并生成了漂亮的图。

arrows /= np.sqrt((arrows ** 2).sum(axis=0))

Then we will scale the arrows to match the samples (as discussed above):然后我们将缩放箭头以匹配样本(如上所述):

arrows *= np.abs(scores).max(axis=0)

Now we can render the biplot:现在我们可以渲染双标图:

plt.figure(figsize=(5, 5))
for i, name in enumerate(targets):
    plt.scatter(*zip(*scores[y == i]), label=name, s=8, alpha=0.5)
plt.legend(title='Class')

width = -0.005 * np.min([np.subtract(*plt.xlim()), np.subtract(*plt.ylim())])
for i, arrow in zip(tops, arrows):
    plt.arrow(0, 0, *arrow, color='k', alpha=0.75, width=width, ec='none',
              length_includes_head=True)
    plt.text(*(arrow * 1.15), features[i], ha='center', va='center')

for i, axis in enumerate('xy'):
    getattr(plt, f'{axis}ticks')([])
    getattr(plt, f'{axis}label')(f'PC{i + 1} ({pvars[i]:.2f}%)')

digits_biplot

I hope my answer is useful to the community.我希望我的回答对社区有用。

To plot the PCA loadings and loading labels in a biplot using matplotlib and scikit-learn, you can follow these steps:要 plot 使用 matplotlib 和 scikit-learn 的双标图中的 PCA 加载和加载标签,您可以按照以下步骤操作:

After fitting the PCA model using decomposition.PCA, retrieve the loadings matrix using the components_ attribute of the model. The loadings matrix is a matrix of the loadings of each original feature on each principal component.使用 decomposition.PCA 拟合 PCA model 后,使用 model 的 components_ 属性检索载荷矩阵。载荷矩阵是每个主成分上每个原始特征的载荷矩阵。

Determine the length of the loadings matrix and create a list of tick labels using the names of the original features.确定载荷矩阵的长度并使用原始特征的名称创建刻度标签列表。

Normalize the loadings matrix so that the length of each loading vector is 1. This will make it easier to visualize the loadings on the biplot.标准化载荷矩阵,使每个载荷向量的长度为 1。这将使双标图上的载荷更容易可视化。

Plot the loadings as arrows on the biplot using pyplot.quiver. Plot 使用 pyplot.quiver 将负载作为双标图上的箭头。 Set the length of the arrows to the absolute value of the loading and the angle to the angle of the loading in the complex plane.将箭头的长度设置为载荷的绝对值,将角度设置为复平面中载荷的角度。

Add the tick labels to the biplot using pyplot.xticks and pyplot.yticks.使用 pyplot.xticks 和 pyplot.yticks 将刻度标签添加到双标图。

Here is an example of how you can modify your code to plot the PCA loadings and loading labels in a biplot Add the loading labels to the biplot using pyplot.text.下面是一个示例,说明如何将代码修改为 plot 双图中的 PCA 加载和加载标签 使用 pyplot.text 添加加载标签到双图中。 You can specify the position of the label using the coordinates of the corresponding loading vector, and set the font size and color using the fontsize and color parameters.您可以使用相应加载向量的坐标指定 label 的 position,并使用 fontsize 和 color 参数设置字体大小和颜色。

Plot the data points on the biplot using pyplot.scatter. Plot 双标图上的数据点使用 pyplot.scatter。

Add a legend to the plot using pyplot.legend to distinguish the different species.使用 pyplot.legend 向 plot 添加图例以区分不同的物种。

Here is the complete code with the above modifications applied:这是应用了上述修改的完整代码:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})

%matplotlib inline
np.random.seed(0)

# Iris dataset
DF_data = pd.DataFrame(load_iris().data, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
                       columns = load_iris().feature_names)

Se_targets = pd.Series(load_iris().target, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])], 
                       name = "Species")

# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data), 
                           index = DF_data.index,
                           columns = DF_data.columns)

# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = 2

# PCA (How I tend to set it up)
Mod_PCA = decomposition.PCA(n_components=m)
DF_PCA = pd.DataFrame(Mod_PCA.fit_transform(DF_standard), 
                      columns=["PC%d" % k for k in range(1,m + 1)]).iloc[:,:K]

# Retrieve the loadings matrix and create the tick labels
loadings = Mod_PCA.components_
tick_labels = DF_data.columns

# Normalize the loadings
loadings = loadings / np.linalg.norm(loadings, axis=1)[:, np.newaxis]

# Plot the loadings as arrows on the biplot
plt.quiver(0, 0, loadings[:,0], loadings[:,1], angles='xy', scale_units='xy', scale=1, color='blue')

# Add the tick labels
plt.xticks(range(-1, 2), tick_labels, rotation='vertical')
plt.yticks(range(-1, 2), tick_labels)

# Add the loading labels
for i, txt in enumerate(tick_labels):
    plt.text(loadings[i, 0], loadings[i, 1], txt, fontsize=12, color='blue')

# Plot the data points on the biplot
color_list = [{0:"r",1:"g",2:"b"}[x] for x in Se_targets]
plt.scatter(x=DF_PCA["PC1"], y=DF_PCA["PC2"], color=color_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM