简体   繁体   English

R中的PLS:预测新观测值将返回拟合值

[英]PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). 在过去的几天中,我在R中开发了多个PLS模型,用于光谱数据(波段作为解释变量)和各种植被参数(作为单独的响应变量)。 In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. 总体而言,数据集包括56个。前28个(训练集)已用于模型校准,现在我要做的就是预测tesset中其余28个观测值的响应值。 For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. 但是由于某种原因,R会继续返回给定数量组件的校准集的拟合值,而不是独立测试集的预测值。 Here is what the model looks like in short. 简而言之,这就是模型。

# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56

data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set

# define explanatory variables (x)
spectra <- caldata[,1:101]

# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation = 
"LOO", jackknife = TRUE)

It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. 然后确定了由3个组件组成的模型在没有过度拟合的情况下产生了最佳性能。 Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components: 因此,以下命令用于使用上面带有3个成分的经过校准的PLS模型来预测测试集中的28个观测值:

predict(refl.pls, ncomp = 3, newdata = valdata)

Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. 从输出看起来很合理,我很快发现,这段代码生成的都是用于校准/训练数据的PLS模型的拟合值,而不是预测值。 I discovered this because the below code, in which newdata = is omitted, yields identical results. 我发现这是因为下面的代码(其中省略了newdata =)产生了相同的结果。

predict(refl.pls, ncomp = 3)

Surely something must be going wrong, although I cannot seem to find out what specifically is. 尽管我似乎无法找出具体是什么,但肯定一定会出错。 Is there someone out there who can, and is willing to help me move in the right direction? 是否有人愿意并且愿意帮助我朝正确的方向前进?

I think the problem is with the nature of the input data. 我认为问题在于输入数据的性质。 Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. 查看?plsrstr(yarn)plsr需要一个非常具体的数据框,我很难使用它。 The input data frame should have a matrix as one of its elements (in your case, the spectral data). 输入数据帧应具有一个矩阵作为其元素之一(在您的情况下为光谱数据)。 I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting): 我认为以下内容可以正常工作(请注意,我更改了训练集的大小,以便它不是原始数据的一半,以便进行故障排除):

library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
        height = rpois(56,10),
        fbm = rpois(56,10),
        nitrogen = rpois(56,10),
        carbon = rpois(56,10),
        chl = rpois(56,10),
        ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)

DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE

refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation = 
"LOO", jackknife = TRUE, subset = train)

res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])

Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs . 请注意,我通过用等于AsIs I保护光谱数据,将其作为矩阵放入数据帧中。 There might be a more standard way to do this, but it works. 可能会有更标准的方法来执行此操作,但是它可以工作。 As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok. 就像我说的,对我而言,数据框内部的矩阵并不完全直观,也不容易理解。

As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous. 至于为什么您的版本无法正常运行,我认为最好的解释是,所有内容都必须在传递给plsr的一个数据帧中,以使数据源完全明确。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM