简体   繁体   English

根据散点图绘制PCA输出,同时根据标签python matplotlib着色

[英]plotting PCA output in scatter plot whilst colouring according to to label python matplotlib

I have just completed a PCA analysis of 14 variables which I have chosen to condense into 2 components. 我刚刚完成了对14个变量的PCA分析,我选择将其浓缩为2个组成部分。

pca = PCA(n_components=2)
pca.fit(z)
a = pca.fit_transform(z)

The output this gives is in form: 这给出的输出形式为:

[[ -3.84514275e+00  -1.19829226e-01]
 [ -4.78476227e+00  -1.35986090e-01]
 [ -2.26702900e+00  -1.19665853e+00]
 [ -5.01021616e+00   2.76005130e+00]
 [ -5.57580326e+00  -2.00656680e+00]
 [ -5.08184415e+00  -3.68721491e+00]
 [ -3.41505366e+00  -7.61184868e-01]
 [ -4.92439159e+00  -1.82147509e+00]
...
 [ -3.34931300e+00   7.57884906e-01]]

I want to do the following: 我要执行以下操作:

  1. plot each observation on a scattergraph with PC1 (x) being the first value in each array and PC2 (y) being the 2nd value. 在散点图上绘制每个观测值,其中PC1(x)是每个数组中的第一个值,PC2(y)是第二个值。

  2. colour each observation according to the corresponding label type (ie A=red, B=blue, C=green, etc) from the initial pre-PCA data. 根据初始PCA之前的数据中相应的标签类型(即A =红色,B =蓝色,C =绿色等)为每个观察结果上色。

  3. label SELECTED (not ALL) observations with the name of the observation from the initial pre-PCA data (ie John, Peter, Sally, etc.) 用初始PCA之前的数据(即John,Peter,Sally等)的名称标记SELECTED(不是全部)观测值

any help is greatly appreciated for any/all of these problems. 任何/所有这些问题的任何帮助,我们将不胜感激。

Worth noting I attempted to do the scatter by: 值得注意的是,我尝试通过以下方式进行分散:

plt.scatter(a[1], a[2])
plt.show()

but obviously this doesn't work as my output of a is not seperated by commas and would only plot 2 points. 但是显然这是行不通的,因为我的a的输出没有用逗号分隔,只能绘制2点。 Can't help my head around it so would appreciate SO's input. 我忍不住要解决这个问题,因此感谢SO的投入。

EDIT: 编辑:

dataset in form: 数据集的形式:

John, A, var1, var2, var3, ..., var14
Peter, A, var1, var2, var3, ..., var14
Sally, B, var1, var2, var3, ..., var14
Cath, C, var1, var2, var3, ..., var14
Jim, A, var1, var2, var3, ..., var14

I'm after something similar to this: 我在追求类似的东西:

在此处输入图片说明

I think your question is now very clear - thanks for editing! 我认为您的问题现在很清楚-感谢您的编辑!

Here's how the plot you describe can be created. 这是您描述的绘图的创建方式。


First, let's generate some example data: 首先,让我们生成一些示例数据:

# Params
n_samples  = 100
m_features =  14
selected_names = ['name_13', 'name_23', 'name_42', 'name_66']

# Generate
np.random.seed(42)
names    = ['name_%i' % i for i in range(n_samples)]
labels   = [np.random.choice(['A','B','C','D']) for i in range(n_samples)]
features = np.random.random((n_samples,m_features))

Next we do the PCA: 接下来,我们执行PCA:

pca = PCA(n_components=2)
features_pca = pca.fit_transform(features)

Then we prepare a list/array of length n that translates the labels A,B,C,... into colors. 然后,我们准备一个长度为n的列表/数组,将标签A,B,C,...转换为颜色。 These can either be hand-selected colors... 这些可以是手工选择的颜色...

# Label to color dict (manual)
label_color_dict = {'A':'red','B':'green','C':'blue','D':'magenta'}

# Color vector creation
cvec = [label_color_dict[label] for label in labels]

...or just a range of integers. ...或只是整数范围

# Label to color dict (automatic)
label_color_dict = {label:idx for idx,label in enumerate(np.unique(labels))}

# Color vector creation
cvec = [label_color_dict[label] for label in labels]

Finally, it's time to plot. 最后,该绘图了。

# Create the scatter plot
plt.figure(figsize=(8,8))
plt.scatter(features_pca[:,0], features_pca[:,1],
            c=cvec, edgecolor='', alpha=0.5)

# Add the labels
for name in selected_names:

    # Get the index of the name
    i = names.index(name)

    # Add the text label
    labelpad = 0.01   # Adjust this based on your dataset
    plt.text(features_pca[i,0]+labelpad, features_pca[i,1]+labelpad, name, fontsize=9)

    # Mark the labeled observations with a star marker
    plt.scatter(features_pca[i,0], features_pca[i,1],
                c=cvec[i], vmin=min(cvec), vmax=max(cvec),
                edgecolor='', marker='*', s=100)

# Add the axis labels
plt.xlabel('PC 1 (%.2f%%)' % (pca.explained_variance_ratio_[0]*100))
plt.ylabel('PC 2 (%.2f%%)' % (pca.explained_variance_ratio_[1]*100)) 

# Done
plt.show()

As you can see, the different colors can be fed into plt.scatter via the c kwarg. 正如你所看到的,不同的颜色可以被送入plt.scatter通过c kwarg。 In addition, I recommend edgecolor='' as this often looks more clear. 另外,我建议使用edgecolor=''因为这通常看起来更清晰。 You can play with alpha to increase/decrease transparency, which will make the labeled points stand out more/less. 您可以使用alpha来增加/减少透明度,这将使标记的点更多/更少地突出。

The labels are simply placed on the plot using plt.text with the appropriate x and y positions, which I here adjust a little bit (using labelpad ) so that the labels are nicely positioned next to the marker. 使用plt.text将标签简单地放置在绘图上, plt.text具有适当的x和y位置,我在这里进行了一些调整(使用labelpad ),以便将标签很好地labelpad在标记旁边。

For the star marker, note that vmin and vmax are important if you are using a numeric color vector, since otherwise the stars will end up in the wrong color. 对于星形标记,请注意,如果您使用数字颜色矢量,则vminvmax非常重要,因为否则星形将以错误的颜色结束。

Here's the result (using the manually defined colors): 结果如下(使用手动定义的颜色):

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM