简体   繁体   English

lm()和predict.lm()的奇怪行为取决于显式命名空间访问器的使用

[英]Bizarre behaviour of lm() and predict.lm() depending on use of explicit namespace accessor

I am interested in some disturbing behaviour of the lm function and the associated predict.lm function in R. The splines base package provides the function bs to generate b-spline expansions, which can then be used to fit a spline model using lm , a versatile linear model fitting function. 我感兴趣的是lm函数的一些令人不安的行为以及R中相关的predict.lm函数。 splines基础包提供了生成b样条展开的函数bs ,然后可以使用lm来拟合样条模型,a多功能线性模型拟合函数。

The lm and predict.lm functions have a lot of built-in convenience that take advantage of formulas and terms. lmpredict.lm函数具有很多内置的便利性,可以利用公式和术语。 If the call to bs() is nested inside the lm call, then the user can provide univariate data to predict , and this data will automatically be expanded into the appropriate b-spline basis. 如果对bs()的调用嵌套在lm调用中,则用户可以提供单变量数据进行predict ,并且此数据将自动扩展为适当的b样条基础。 This expanded matrix of data will then predicted upon as usual. 然后将照常预测这种扩展的数据矩阵。

library(splines)

x <- sort(runif(50, 0, 10))
y <- x^2

splineModel <- lm(y ~ bs(x, y, degree = 3, knots = c(3, 6)))

newData <- data.frame(x = 4)
prediction <- predict(splineModel, newData) # 16

plot(x, y)
lines(x, splineModel$fitted.values, col = 'blue3')
points(newData$x, prediction, pch = 3, cex = 3, col = 'red3')
legend("topleft", legend = c("Data", "Fitted Values", "Predicted Value"),
       pch = c(1, NA, 3), col = c('black', 'blue3', 'red3'), lty = c(NA, 1, NA))

As we see, this works perfectly: 如我们所见,这完美地运作:

在此输入图像描述

The strangeness happens when one uses the :: operator to explicitly indicate that the bs function is exported from the namespace of the splines package. 当使用::运算符明确指示从splines包的命名空间导出bs函数时,会发生splines The following code snippet is identical except for that change: 以下代码段除了该更改外完全相同:

library(splines)

x <- sort(runif(50, 0, 10))
y <- x^2

splineModel <- lm(y ~ splines::bs(x, y, degree = 3, knots = c(3, 6)))

newData <- data.frame(x = 4) 
prediction <- predict(splineModel, newData) # 6.40171

plot(x, y)
lines(x, splineModel$fitted.values, col = 'blue3')
points(newData$x, prediction, pch = 3, cex = 3, col = 'red3')
legend("topleft", legend = c("Data", "Fitted Values", "Predicted Value"),
       pch = c(1, NA, 3), col = c('black', 'blue3', 'red3'), lty = c(NA, 1, NA))

在此输入图像描述

The exact same results are produced in the second snippet if the splines package is never attached using library in the first place. 如果splines包从未首先使用library附加,则在第二个片段中生成完全相同的结果。 I cannot think of another situation in which the use of the :: operator on an already-loaded package changes program behaviour. 我想不出另一种情况,即在已经加载的包上使用::运算符会改变程序行为。

The same behaviour arises using other functions from splines like the natural spline basis implementation ns . 使用splines条曲线中的其他函数(如自然样条基础实现ns会产生相同的行为。 Interestingly, in both cases the "y hat" or fitted values are reasonable and match one another. 有趣的是,在这两种情况下,“y hat”或拟合值都是合理的并且相互匹配。 The fitted model objects are identical except for names of attributes, as far as I can tell. 据我所知,拟合的模型对象除属性名称外是相同的。

I have been unable to pin down the source of this behaviour. 我无法确定此行为的来源。 While this may read like a bug report, my questions are 虽然这可能看起来像一个错误报告,但我的问题

  1. Why does this happen? 为什么会这样? I have been trying to follow through predict.lm but cannot pin down where the divergence occurs. 我一直试图通过predict.lm但不能确定发散的位置。
  2. Is this somehow intended behaviour, and if so where can I learn more about it? 这是某种预期的行为,如果是这样,我可以在哪里了解更多信息?

So the problem is that the model needs to keep track of the knots that were calculated with the original data and use those values when predicting new data. 所以问题是模型需要跟踪用原始数据计算的结,并在预测新数据时使用这些值。 That typically happens in the model.frame() call inside the lm() call. 这通常发生在model.frame()调用内的model.frame() lm()调用中。 The bs() function returns a class of "bs" and when making the model.frame, that column is dispatched to splines:::makepredictcall.bs to try to capture the boundary knots. bs()函数返回一个"bs"类,在创建model.frame时,该列被调度到splines:::makepredictcall.bs以尝试捕获边界结。 (You can see the makepredictcall calls in the model.frame.default function.) (您可以在model.frame.default函数中看到makepredictcall调用。)

But if we compare the results 但是,如果我们比较结果

splineModel1 <- lm(y ~ bs(x, y, degree = 3, knots = c(3, 6)))
attr(terms(splineModel1), "predvar")
# list(y, bs(x, degree = 3L, knots = c(3, 6), Boundary.knots =  c(0.275912734214216, 
# 9.14309860439971), intercept = FALSE))

splineModel2 <- lm(y ~ splines::bs(x, y, degree = 3, knots = c(3, 6)))
attr(terms(splineModel2), "predvar")
# list(y, splines::bs(x, y, degree = 3, knots = c(3, 6)))

Notice how the second one doesn't capture the Boundary.knots . 注意第二个如何捕获Boundary.knots This is because of the splines:::makepredictcall.bs function which actually looks at the name of the call 这是因为splines:::makepredictcall.bs函数实际上是查看调用的名称

function (var, call) {
    if (as.character(call)[1L] != "bs") 
        return(call)
    ...
}

When you use splines::bs in the formula, then as.character(call)[1L] returns "splines::bs" which does not match "bs" so nothing happens. 当你在公式中使用splines::bs时, as.character(call)[1L]返回"splines::bs" ,它与"bs"不匹配,所以没有任何反应。 It's unclear to me why this check is there. 我不清楚为什么这个检查在那里。 Seems like the method dispatching should be sufficient to assume it's a bs object. 似乎方法调度应该足以假设它是一个bs对象。

In my opinion this does not seem like desired behavior and probably should be fixed. 在我看来,这似乎不是想要的行为,可能应该修复。 But the function bs() should not really be called without loading the package because functions like makepredictcall.bs don't be imported either so the custom dispatching for those objects would be broken. 但是在不加载包的情况下不应该真正调用函数bs() ,因为makepredictcall.bs类的makepredictcall.bs也不会被导入,因此这些对象的自定义调度将被破坏。

It seems to be related to the boundary knot values in the 'predvars' attribute of the 'terms' part of splineModel. 它似乎与splineModel的'terms'部分的'predvars'属性中的边界结值有关。

If we call them splineModel_1 and splineModel_2 如果我们称它们为splineModel_1和splineModel_2

predict(splineModel_1, newData)
16
predict(splineModel_2, newData)
6.969746

attr(splineModel_2[["terms"]], "predvars") <- attr(splineModel_1[["terms"]], "predvars")

predict(splineModel_1, newData)
16
predict(splineModel_2, newData)
16

attr(splineModel_1[["terms"]], "predvars")
list(y, bs(x, degree = 3L, knots = c(3, 6), Boundary.knots = c(0.323248628992587, 9.84225275926292), intercept = FALSE))

attr(splineModel_2[["terms"]], "predvars")
list(y, splines::bs(x, y, degree = 3, knots = c(3, 6)))

As you can see the difference is in the Boundary.knots. 正如您所看到的,区别在于Boundary.knots。 The only other difference is that the intercept defaults to FALSE so that's probably not relevant. 唯一的另一个区别是截距默认为FALSE,因此可能不相关。 The Boundary.knots are taken from the min and max of x. Boundary.knots取自x的最小值和最大值。 As for it being set by one version of bs and not another, I can only assume this is a relic in the code of lm that looks for 'bs' and not 'splines::bs' to set the Boundary.knots correctly. 至于它是由一个版本的bs而不是另一个版本设置的,我只能假设这是lm代码中的遗物,它寻找'bs'而不是'splines :: bs'来正确设置Boundary.knots。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM