R中的PLS：预测新观测值将返回拟合值

Question

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). 在过去的几天中，我在R中开发了多个PLS模型，用于光谱数据（波段作为解释变量）和各种植被参数（作为单独的响应变量）。 In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. 总体而言，数据集包括56个。前28个（训练集）已用于模型校准，现在我要做的就是预测tesset中其余28个观测值的响应值。 For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. 但是由于某种原因，R会继续返回给定数量组件的校准集的拟合值，而不是独立测试集的预测值。 Here is what the model looks like in short. 简而言之，这就是模型。

# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56

data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set

# define explanatory variables (x)
spectra <- caldata[,1:101]

# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation = 
"LOO", jackknife = TRUE)

It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. 然后确定了由3个组件组成的模型在没有过度拟合的情况下产生了最佳性能。 Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components: 因此，以下命令用于使用上面带有3个成分的经过校准的PLS模型来预测测试集中的28个观测值：

predict(refl.pls, ncomp = 3, newdata = valdata)

Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. 从输出看起来很合理，我很快发现，这段代码生成的都是用于校准/训练数据的PLS模型的拟合值，而不是预测值。 I discovered this because the below code, in which newdata = is omitted, yields identical results. 我发现这是因为下面的代码（其中省略了newdata =）产生了相同的结果。

predict(refl.pls, ncomp = 3)

Surely something must be going wrong, although I cannot seem to find out what specifically is. 尽管我似乎无法找出具体是什么，但肯定一定会出错。 Is there someone out there who can, and is willing to help me move in the right direction? 是否有人愿意并且愿意帮助我朝正确的方向前进？

Answer 1

I think the problem is with the nature of the input data. 我认为问题在于输入数据的性质。 Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. 查看?plsr和str(yarn) ， plsr需要一个非常具体的数据框，我很难使用它。 The input data frame should have a matrix as one of its elements (in your case, the spectral data). 输入数据帧应具有一个矩阵作为其元素之一（在您的情况下为光谱数据）。 I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting): 我认为以下内容可以正常工作（请注意，我更改了训练集的大小，以便它不是原始数据的一半，以便进行故障排除）：

library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
        height = rpois(56,10),
        fbm = rpois(56,10),
        nitrogen = rpois(56,10),
        carbon = rpois(56,10),
        chl = rpois(56,10),
        ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)

DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE

refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation = 
"LOO", jackknife = TRUE, subset = train)

res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])

Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs . 请注意，我通过用等于AsIs I保护光谱数据，将其作为矩阵放入数据帧中。 There might be a more standard way to do this, but it works. 可能会有更标准的方法来执行此操作，但是它可以工作。 As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok. 就像我说的，对我而言，数据框内部的矩阵并不完全直观，也不容易理解。

As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous. 至于为什么您的版本无法正常运行，我认为最好的解释是，所有内容都必须在传递给plsr的一个数据帧中，以使数据源完全明确。

R中的PLS：预测新观测值将返回拟合值

问题描述

1 个解决方案

解决方案1
0 2016-01-04 14:06:21

R中的PLS：预测新观测值将返回拟合值

问题描述

1 个解决方案

解决方案1 0 2016-01-04 14:06:21

解决方案1
0 2016-01-04 14:06:21