简体   繁体   English

如何在 R 中获得 SVD/PCA 的 plot 距离双标图和相关双标图结果?

[英]How to plot distance biplot and correlation biplot results of SVD/PCA in R?

I searched for a long time for a straightforward explanation of the distance vs correlation biplots, as well as an explanation of how to transform the standard outputs of PCA to achieve the two biplots.我搜索了很长时间,以寻找距离与相关双图的直接解释,以及如何转换 PCA 的标准输出以实现两个双图的解释。 All the stack overflow explanations 1 2 3 4 I saw went way over my head with math terms.我看到的所有堆栈溢出解释1 2 3 4都用数学术语让我大吃一惊。 How can I create both a distance biplot and a correlation biplot using the outputs of R's prcomp?如何使用 R 的 prcomp 的输出创建距离双图和相关双图?

The best explanation I found is some lecture slides from Pierre Legendre, Département de sciences biologiques, Université de Montréal ( http://biol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf ).我找到的最好的解释是来自蒙特利尔大学生物科学系 Pierre Legendre 的一些演讲幻灯片( http://biol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.Z437175BA4191210EE004E1D937 However, while these slides did show the way to plot a distance and correlation biplot manually, they didn't show how to plot the distance and correlation biplots from the results of prcomp.然而,虽然这些幻灯片确实显示了手动 plot 距离和相关双图的方法,但它们没有显示如何从 prcomp 的结果中 plot 距离和相关双图。

So I worked through an example that shows how one can use the outputs of prcomp for them to be equivalent to the example walked through in the pdf above.因此,我完成了一个示例,该示例显示了如何使用 prcomp 的输出使其与上面 pdf 中的示例等效。 I am leaving this here for future people like myself who are wondering how to plot a distance vs correlation biplot and when you want to use each (according to Pierre Legendre)我将这里留给像我这样想知道如何 plot 距离与相关双标图以及何时要使用它们的未来人(根据 Pierre Legendre)

set.seed(1)

#Run standard PCA
pca_res <- prcomp(mtcars[, 1:7], center = TRUE, scale = TRUE, retx = TRUE)

#To print a distance biplot, simply plot pca_red$x as points and $rotation
#as vectors
library(ggplot2)

arrow_len <- 3 #arbitrary scaling of arrows so they're same mag as PC scores
ggplot(data = as.data.frame(pca_res$x), aes(x = PC1, y = PC2)) +
  geom_point() +
  geom_segment(data = as.data.frame(pca_res$rotation),
                    aes(x = 0, y = 0, yend = arrow_len*PC1, xend = arrow_len*PC2),
                    arrow = arrow(length = unit(0.02, "npc"))) +
  geom_text(data = as.data.frame(pca_res$rotation),
            mapping = aes(y = arrow_len*PC1, x = arrow_len*PC2,
                label = row.names(pca_res$rotation)))

#This is equivalent to the following steps:
Y_centered <- scale(mtcars[, 1:7], center = TRUE, scale = TRUE)
Y_eig <- eigen(cov(Y_centered)) 
#Note that Y_eig$vectors == pca_res$rotation ("rotations" or "loadings")
# and Y_eig$values (eigenvalues) == pca_res$sdev**2

#For a distance biplot
U_frame <- Y_eig$vectors
#F is your PC scores, achieved by multiplying your original data by the rotations
F_frame <- Y_centered %*% U_frame

#flipping constants if needed bc PC axis direction is arbitrary
x_flip = -1
y_flip = -1
ggplot(data = as.data.frame(F_frame), aes(x = x_flip*V1, y = y_flip*V2)) +
  geom_point() +
  geom_segment(data = as.data.frame(U_frame),
               aes(x = 0, y = 0, yend = y_flip*arrow_len*V1, xend = x_flip*arrow_len*V2),
               arrow = arrow(length = unit(0.02, "npc"))) +
  geom_text(data = as.data.frame(U_frame),
            mapping = aes(y = y_flip*arrow_len*V1, x = x_flip*arrow_len*V2,
                          label = colnames(Y_centered)))

#To print a correlation biplot, matrix multiply your rotations/loadings
# by the identity matrix times your PCA standard deviations 
# (equivalent to the sqrt of your eigen values)
U_frame_scaling2 <- U_frame %*% diag(Y_eig$values^(0.5))

#And divide your PC scores by your PCA standard deviations
# (equivalent to 1/sqrt(eigen values)
F_frame_scaling2 <- F_frame %*% diag(Y_eig$values^(-0.5))

#Plot
arrow_len <- 1.5 #arbitrary scaling of arrows so they're same mag as PC scores

ggplot(data = as.data.frame(pca_res$x %*% diag(1/pca_res$sdev)), 
       aes(x = V1, y = V2)) +
  geom_point() +
  geom_segment(data = as.data.frame(pca_res$rotation %*% diag(pca_res$sdev)),
               aes(x = 0, y = 0, yend = arrow_len*V1, xend = arrow_len*V2),
               arrow = arrow(length = unit(0.02, "npc"))) +
  geom_text(data = as.data.frame(pca_res$rotation %*% diag(pca_res$sdev)),
            mapping = aes(y = arrow_len*V1, x = arrow_len*V2,
                          label = row.names(pca_res$rotation)))

ggplot(data = as.data.frame(F_frame_scaling2), aes(x = x_flip*V1, y = y_flip*V2)) +
  geom_point() +
  geom_segment(data = as.data.frame(U_frame_scaling2),
               aes(x = 0, y = 0, yend = y_flip*arrow_len*V1, xend = x_flip*arrow_len*V2),
               arrow = arrow(length = unit(0.02, "npc"))) +
  geom_text(data = as.data.frame(U_frame_scaling2),
            mapping = aes(y = y_flip*arrow_len*V1, x = x_flip*arrow_len*V2,
                          label = colnames(Y_centered)))

As for the differences between the two (in case the pdf above becomes unavailable at some point):至于两者之间的区别(如果上面的 pdf 在某些时候不可用):

Scaling type 1: distance biplot, used when the interest is on the positions of the objects with respect to one another.缩放类型 1:距离双标图,当感兴趣的是对象相对于彼此的位置时使用。

  • Plot matrices F to represent the objects and U for the variables. Plot 矩阵 F 表示对象,U 表示变量。

Scaling type 2: correlation biplot, used when the angular relationships among the variables are of primary interest.缩放类型 2:相关双图,当主要关注变量之间的 angular 关系时使用。

  • Plot matrices G to represent the objects and Usc2 for the variables, where G = FΛ–1/2, and Usc2 = UΛ1/2. Plot 矩阵 G 表示对象,Usc2 表示变量,其中 G = FΛ–1/2,Usc2 = UΛ1/2。

In scaling 1 (distance biplot),在缩放 1(距离双标图)中,

  • the sites have variances, along each axis (or principal component), equal to the axis eigenvalue (column of F);站点沿每个轴(或主成分)具有等于轴特征值(F 列)的方差;
  • the eigenvectors (columns of U) are normed to lengths = 1;特征向量(U 的列)被规范为长度 = 1;
  • the length (norm) of each species vector in the pdimensional ordination space (rows of U) is 1. p维排序空间(U的行)中每个物种向量的长度(范数)为1。

In scaling 2 (correlation biplot),在缩放 2(相关双标图)中,

  • the sites have unit variance along each axis (columns of G);站点沿每个轴(G 列)具有单位方差;
  • the eigenvectors (columns of Usc2) are normed to lengths = sqrt(eigenvalues);特征向量(Usc2 的列)被规范为长度 = sqrt(特征值);
  • the norm of each species vector in the p-dimensional ordination space (rows of Usc2) is its standard deviation. p 维排序空间(Usc2 的行)中每个物种向量的范数是其标准差。

In scaling 1 (distance biplot),在缩放 1(距离双标图)中,

  1. Distances among objects approximate their Euclidean distances in full multidimensional space.对象之间的距离近似于它们在完整多维空间中的欧几里得距离。
  2. Projecting an object at right angle on a descriptor approximates the position of the object along that descriptor.在描述符上以直角投影 object 近似于沿该描述符的 object 的 position。
  3. Since descriptors have equal lengths of 1 in the full-dimensional space, the length of the projection of a descriptor in reduced space indicates how much it contributes to the formation of that space.由于描述符在全维空间中具有相等的长度 1,因此描述符在缩减空间中的投影长度表明它对该空间的形成有多大贡献。
  4. A scaling 1 biplot thus shows which variables contribute the most to the ordination in a few dimensions (see also section: Equilibrium contribution of variables).因此,缩放 1 双图显示了哪些变量在几个维度上对排序的贡献最大(另请参见:变量的平衡贡献部分)。
  5. The descriptor-axes are orthogonal (90°) to one another in multidimensional space.描述符轴在多维空间中彼此正交(90°)。 These right angles, projected in reduced space, do not reflect the variables' correlations.这些直角投影在缩小的空间中,不反映变量的相关性。

In scaling 2 (correlation biplot),在缩放 2(相关双标图)中,

  1. Distances among objects approximate their Mahalanobis distances in full multidimensional space.对象之间的距离近似于它们在完整多维空间中的马氏距离。
  2. Projecting an object at right angle on a descriptor approximates the position of the object along that descriptor.在描述符上以直角投影 object 近似于沿该描述符的 object 的 position。
  3. Since descriptors have lengths sj in full-dimensional space, the length of the projection of a descriptor j in reduced space is an approximation of its standard deviation sj.由于描述符在全维空间中具有长度 sj,因此描述符 j 在缩减空间中的投影长度是其标准偏差 sj 的近似值。 Note: sj is 1 when the variables have been standardized.注:变量标准化后,sj 为 1。
  4. The angles between descriptors in the biplot reflect their correlations.双图中描述符之间的角度反映了它们的相关性。
  5. When the distance relationships among objects are important for interpretation, this type of biplot is inadequate;当对象之间的距离关系对解释很重要时,这种类型的双标图是不合适的; a distance biplot should be used.应使用距离双标图。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM