繁体   English   中英

计算 R 中 PCA 的转换?

[英]Calculate the transformation of a PCA in R?

我正在寻找表示从数据集到其 PC 的映射的权重。 目的是建立一个“校准的”固定空间,例如三种葡萄酒,当引入新的观察结果(例如一种新的葡萄酒)时,可以将其分配在先前校准的空间内,而无需更改固定的 PC 值。 因此,可以通过执行应用于前三种排序的转换来适当地分配新的观察结果。

 library(ggbiplot)
 data(wine)
 wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
 print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups =   wine.class, ellipse = TRUE, circle = TRUE))

编辑:葡萄酒数据集被拆分为训练数据,以获得我所谓的校准空间。

samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]

然后使用训练数据对要验证的数据集进行子集化,例如

wine.valid <- wine[-samp,]

#PCA on training data
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
#use the transformation matrix from the training data to predict the validation data
pred <- predict(wine.train.pca, newdata = wine.valid)

随后,如何表示由训练产生的校准空间和转换的验证/测试数据将在此线程中解决

使用prcomppredict函数很容易做到这一点。 下面我通过将您的葡萄酒数据分成两部分来展示性能; 训练和验证数据集。 然后将使用训练集上的 prcomp 拟合 PCA 对验证 PCA 坐标的预测与从完整数据集导出的那些相同坐标进行比较:

library(ggbiplot)
data(wine)

# pca on whole dataset
wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)

# pca on training part of dataset, then project new data onto pca coordinates 
set.seed(1)
samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]
wine.valid <- wine[-samp,]
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
pred <- predict(wine.train.pca, newdata = wine.valid)

# plot original vs predicted pca coordinates
matplot(wine.pca$x[-samp,,1:4], pred[,1:4])

在此处输入图片说明

您还可以查看预测坐标和原始坐标之间的相关性,并发现领先 PC 的相关性非常高:

# correlation of predicted coordinates
abs(diag(cor(wine.pca$x[-samp,], pred[,])))
#       PC1       PC2       PC3       PC4       PC5       PC6       PC7       PC8       PC9      PC10 
# 0.9991291 0.9955028 0.9882540 0.9418268 0.9681989 0.9770390 0.9603593 0.8991734 0.8090762 0.9326917 
#      PC11      PC12      PC13 
# 0.9270951 0.9596963 0.9397388 

编辑:

以下是使用randomForest进行分类的示例:

library(ggbiplot)
data(wine)
wine$class <- wine.class

# install.packages("randomForest")
library(randomForest)

set.seed(1)
train <- sample(nrow(wine), nrow(wine)*0.5)
valid <- seq(nrow(wine))[-train]
winetrain <- wine[train,]
winevalid <- wine[valid,]

modfit <- randomForest(class~., data=winetrain, nTree=500)
pred <- predict(modfit, newdata=winevalid, type='class')

每个变量的重要性可以通过以下方式返回:

importance(modfit) # importance of variables in predition
#                MeanDecreaseGini
# Alcohol               8.5032770
# MalicAcid             1.3122286
# Ash                   0.6827924
# AlcAsh                1.9517369
# Mg                    1.3632713
# Phenols               2.7943536
# Flav                  6.5798205
# NonFlavPhenols        1.1712744
# Proa                  1.2412928
# Color                 8.7097870
# Hue                   5.2674082
# OD                    6.6101764
# Proline              10.7032775

并且,预测精度返回如下:

TAB <- table(pred, winevalid$class) # table of preditions vs. original classifications
TAB
# pred         barolo grignolino barbera
#   barolo         29          1       0
#   grignolino      1         30       0
#   barbera         0          1      27

sum(diag(TAB)) / sum(TAB) # overall accuracy
# [1] 0.9662921

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM