简体   繁体   English

从lm()预测'mlm'线性模型对象

[英]Prediction of 'mlm' linear model object from `lm()`

I have three datasets: 我有三个数据集:

response - matrix of 5(samples) x 10(dependent variables) 响应-5(样本)x 10(因变量)的矩阵

predictors - matrix of 5(samples) x 2(independent variables) 预测变量-5(样本)x 2(独立变量)的矩阵

test_set - matrix of 10(samples) x 10(dependent variables defined in response) test_set-10(样本)x 10(响应中定义的因变量)的矩阵

response <- matrix(sample.int(15, size = 5*10, replace = TRUE), nrow = 5, ncol = 10)
colnames(response) <- c("1_DV","2_DV","3_DV","4_DV","5_DV","6_DV","7_DV","8_DV","9_DV","10_DV") 
predictors <- matrix(sample.int(15, size = 7*2, replace = TRUE), nrow = 5, ncol = 2)
colnames(predictors) <- c("1_IV","2_IV")
test_set <- matrix(sample.int(15, size = 10*2, replace = TRUE), nrow = 10, ncol = 2)
colnames(test_set) <- c("1_IV","2_IV")

I'm doing a multivariate linear model using a training set defined as the combination of response and predictor sets, and I would like to use this model to make predictions for the test set: 我正在使用定义为响应集和预测变量集组合的训练集进行多元线性模型,我想使用此模型对测试集进行预测:

training_dataframe <- data.frame(predictors, response)
fit <- lm(response ~ predictors, data = training_dataframe)
predictions <- predict(fit, data.frame(test_set))

However, the results for predictions are really odd: 但是,预测结果确实很奇怪:

predictions

First off the matrix dimensions are 5 x 10, which is the number of samples in the response variable by the number of DVs. 首先,矩阵尺寸为5 x 10,这是响应变量中的样本数除以DV数。

I'm not very skilled with this type of analysis in R, but shouldn't I be getting a 10 x 10 matrix, so that I have predictions for each row in my test_set? 我对R中的这种类型的分析不是很熟练,但是我不应该得到10 x 10的矩阵,以便对test_set中的每一行都有预测吗?

Any help with this issue would be greatly appreciated, Martin 马丁,对此问题的任何帮助将不胜感激。

You are stepping into a poorly supported part in R. The model class you have is "mlm", ie, "multiple linear models", which is not the standard "lm" class. 您将进入R中受支持不佳的部分。您拥有的模型类是“ mlm”,即“多个线性模型”,它不是标准的“ lm”类。 You get it when you have several (independent) response variables for a common set of covariates / predictors. 当您有一组共同的协变量/预测变量的(独立)响应变量时,就会得到此结果。 Although lm() function can fit such model, predict method is poor for "mlm" class. 尽管lm()函数可以适合这种模型,但对于“ mlm”类而言, predict方法很差。 If you look at methods(predict) , you would see a predict.mlm* . 如果您查看methods(predict) ,您将看到predict.mlm* Normally for a linear model with "lm" class, predict.lm is called when you call predict ; 通常用“LM”类线性模型, predict.lm被称为当你调用predict ; but for a "mlm" class the predict.mlm* is called. 但是对于“ mlm”类,将调用predict.mlm*

predict.mlm* is too primitive. predict.mlm*太原始了。 It does not allow se.fit , ie, it can not produce prediction errors, confidence / prediction intervals, etc, although this is possible in theory. 尽管理论上可行,但它不允许se.fit ,即,它不会产生预测误差,置信度/预测间隔等。 It can only compute prediction mean. 它只能计算预测平均值。 If so, why do we want to use predict.mlm* at all?! 如果是这样,为什么我们要完全使用predict.mlm* The prediction mean can be obtained by a trivial matrix-matrix multiplication (in standard "lm" class this is a matrix-vector multiplication), so we can do it on our own. 预测平均值可以通过平凡的矩阵-矩阵乘法获得(在标准“ lm”类中,这是矩阵-矢量乘法),因此我们可以自己完成。

Consider this small, reproduce example. 考虑一下这个小例子。

set.seed(0)
## 2 response of 10 observations each
response <- matrix(rnorm(20), 10, 2)
## 3 covariates with 10 observations each
predictors <- matrix(rnorm(30), 10, 3)
fit <- lm(response ~ predictors)

class(fit)
# [1] "mlm" "lm"

beta <- coef(fit)
#                  [,1]       [,2]
#(Intercept)  0.5773235 -0.4752326
#predictors1 -0.9942677  0.6759778
#predictors2 -1.3306272  0.8322564
#predictors3 -0.5533336  0.6218942

When you have a prediction data set: 有了预测数据集时:

# 2 new observations for 3 covariats
test_set <- matrix(rnorm(6), 2, 3)

we first need to pad an intercept column 我们首先需要填充一个拦截列

Xp <- cbind(1, test_set)

Then do this matrix multiplication 然后做这个矩阵乘法

pred <- Xp %*% beta
#          [,1]      [,2]
#[1,] -2.905469  1.702384
#[2,]  1.871755 -1.236240

Perhaps you have noticed that I did not even use a data frame here. 也许您已经注意到我在这里甚至没有使用数据框。 Yes it is unnecessary as you have everything in matrix form. 是的,这是没有必要的,因为一切都以矩阵形式出现。 For those R wizards, maybe using lm.fit or even qr.solve is more straightforward. 对于那些R向导,也许使用lm.fit甚至qr.solve更为简单。


But as a complete answer, it is a must to demonstrate how to use predict.mlm to get our desired result. 但是,作为一个完整的答案,必须演示如何使用predict.mlm获得所需的结果。

## still using previous matrices
training_dataframe <- data.frame(response = I(response), predictors = I(predictors))
fit <- lm(response ~ predictors, data = training_dataframe)
newdat <- data.frame(predictors = I(test_set))
pred <- predict(fit, newdat)
#          [,1]      [,2]
#[1,] -2.905469  1.702384
#[2,]  1.871755 -1.236240

Note the I() when I use data.frame() . 使用data.frame()时请注意I() data.frame() This is a must when we want to obtain a data frame of matrices . 当我们想要获得矩阵的数据帧时,这是必须的。 You can compare the difference between: 您可以比较以下两者之间的区别:

str(data.frame(response = I(response), predictors = I(predictors)))
#'data.frame':  10 obs. of  2 variables:
# $ response  : AsIs [1:10, 1:2] 1.262954.... -0.32623.... 1.329799.... 1.272429.... 0.414641.... ...
# $ predictors: AsIs [1:10, 1:3] -0.22426.... 0.377395.... 0.133336.... 0.804189.... -0.05710.... ...

str(data.frame(response = response, predictors = predictors))
#'data.frame':  10 obs. of  5 variables:
# $ response.1  : num  1.263 -0.326 1.33 1.272 0.415 ...
# $ response.2  : num  0.764 -0.799 -1.148 -0.289 -0.299 ...
# $ predictors.1: num  -0.2243 0.3774 0.1333 0.8042 -0.0571 ...
# $ predictors.2: num  -0.236 -0.543 -0.433 -0.649 0.727 ...
# $ predictors.3: num  1.758 0.561 -0.453 -0.832 -1.167 ...

Without I() to protect the matrix input, data are messy. 没有I()保护矩阵输入,数据将变得混乱。 It is amazing that this will not cause problem to lm , but predict.mlm will have a hard time obtaining the correct matrix for prediction, if you don't use I() . 令人惊奇的是,这不会给lm造成问题,但是,如果您不使用I() ,则predict.mlm将很难获得正确的预测矩阵。

Well, I would recommend using a "list" instead of a "data frame" in this case. 好吧,在这种情况下,我建议使用“列表”而不是“数据框”。 data argument in lm as well newdata argument in predict allows list input. lm data参数以及predict newdata参数允许列表输入。 A "list" is a more general structure than a data frame, which can hold any data structure without difficulty. “列表”是一个比数据帧更通用的结构,它可以毫无困难地保存任何数据结构。 We can do: 我们可以做的:

## still using previous matrices
training_list <- list(response = response, predictors = predictors)
fit <- lm(response ~ predictors, data = training_list)
newdat <- list(predictors = test_set)
pred <- predict(fit, newdat)
#          [,1]      [,2]
#[1,] -2.905469  1.702384
#[2,]  1.871755 -1.236240

Perhaps in the very end, I should stress that it is always safe to use formula interface, rather than matrix interface. 也许到最后,我应该强调指出,使用公式接口而不是矩阵接口始终是安全的。 I will use R built-in dataset trees as a reproducible example. 我将使用R内置数据集trees作为可重现的示例。

fit <- lm(cbind(Girth, Height) ~ Volume, data = trees)

## use the first two rows as prediction dataset
predict(fit, newdata = trees[1:2, ])
#     Girth   Height
#1 9.579568 71.39192
#2 9.579568 71.39192

Perhaps you still remember my saying that predict.mlm* is too primitive to support se.fit . 也许您仍然记得我的说法, predict.mlm*太原始了,无法支持se.fit This is the chance to test it. 这是测试它的机会。

predict(fit, newdata = trees[1:2, ], se.fit = TRUE)
#Error in predict.mlm(fit, newdata = trees[1:2, ], se.fit = TRUE) : 
#  the 'se.fit' argument is not yet implemented for "mlm" objects

Oops... How about confidence / prediction intervals (actually without the ability to compute standard error it is impossible to produce those intervals) ? 糟糕...置信区间/预测区间(实际上没有计算标准误差的能力,就不可能产生这些区间) Well, predict.mlm* will just ignore it. 好吧, predict.mlm*只会忽略它。

predict(fit, newdata = trees[1:2, ], interval = "confidence")
#     Girth   Height
#1 9.579568 71.39192
#2 9.579568 71.39192

So this is so different compared with predict.lm . 因此,与predict.lm相比,它是如此不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM