[英]PLS in R: Predicting new observations returns Fitted values instead
In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). 在过去的几天中,我在R中开发了多个PLS模型,用于光谱数据(波段作为解释变量)和各种植被参数(作为单独的响应变量)。 In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset.
总体而言,数据集包括56个。前28个(训练集)已用于模型校准,现在我要做的就是预测tesset中其余28个观测值的响应值。 For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set.
但是由于某种原因,R会继续返回给定数量组件的校准集的拟合值,而不是独立测试集的预测值。 Here is what the model looks like in short.
简而言之,这就是模型。
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. 然后确定了由3个组件组成的模型在没有过度拟合的情况下产生了最佳性能。 Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
因此,以下命令用于使用上面带有3个成分的经过校准的PLS模型来预测测试集中的28个观测值:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. 从输出看起来很合理,我很快发现,这段代码生成的都是用于校准/训练数据的PLS模型的拟合值,而不是预测值。 I discovered this because the below code, in which newdata = is omitted, yields identical results.
我发现这是因为下面的代码(其中省略了newdata =)产生了相同的结果。
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. 尽管我似乎无法找出具体是什么,但肯定一定会出错。 Is there someone out there who can, and is willing to help me move in the right direction?
是否有人愿意并且愿意帮助我朝正确的方向前进?
I think the problem is with the nature of the input data. 我认为问题在于输入数据的性质。 Looking at
?plsr
and str(yarn)
that goes with the example, plsr
requires a very specific data frame that I find tricky to work with. 查看
?plsr
和str(yarn)
, plsr
需要一个非常具体的数据框,我很难使用它。 The input data frame should have a matrix as one of its elements (in your case, the spectral data). 输入数据帧应具有一个矩阵作为其元素之一(在您的情况下为光谱数据)。 I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
我认为以下内容可以正常工作(请注意,我更改了训练集的大小,以便它不是原始数据的一半,以便进行故障排除):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I
which equates to AsIs
. 请注意,我通过用等于
AsIs
I
保护光谱数据,将其作为矩阵放入数据帧中。 There might be a more standard way to do this, but it works. 可能会有更标准的方法来执行此操作,但是它可以工作。 As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
就像我说的,对我而言,数据框内部的矩阵并不完全直观,也不容易理解。
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr
for the data sources to be completely unambiguous. 至于为什么您的版本无法正常运行,我认为最好的解释是,所有内容都必须在传递给
plsr
的一个数据帧中,以使数据源完全明确。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.