简体   繁体   English

R的MLR中的预测函数产生的结果与预测不一致

[英]Predict function in R's MLR yielding results inconsistent with predict

I'm using the mlr package's framework to build a svm model to predict landcover classes in an image. 我正在使用mlr软件包的框架来构建svm模型,以预测图像中的土地覆盖类。 I used the raster package's predict function and also converted the raster to a dataframe and then predicted on that dataframe using the "learner.model" as input. 我使用了栅格数据包的预测功能,还将栅格转换为数据框,然后使用“ learner.model”作为输入对该数据框进行了预测。 These methods gave me realistic results. 这些方法给了我现实的结果。

Work well: 运作良好:

> predict(raster, mod$learner.model)

or 要么

> xy <- as.data.frame(raster, xy = T)

> C <- predict(mod$learner.model, xy)

However, if I predict on the dataframe derived from the raster without specifying the learner.model, the results are not the same. 但是,如果我在不指定learninger.model的情况下对从栅格派生的数据帧进行预测,则结果将有所不同。

> C2 <- predict(mod, newdata=xy)

C2$data$response is not identical to C. Why? C2 $ data $ response与C不同。为什么?


Here is a reproducible example that demonstrates the problem: 这是一个可重现的示例,演示了此问题:

> library(mlr)
 > library(kernlab)
 > x1 <- rnorm(50)
 > x2 <- rnorm(50, 3)
 > x3 <- rnorm(50, -20, 3)
 > C <- sample(c("a","b","c"), 50, T)
 > d <-  data.frame(x1, x2, x3, C)
 > classif <- makeClassifTask(id = "example", data = d, target = "C")
 > lrn <- makeLearner("classif.ksvm", predict.type = "prob", fix.factors.prediction = T)
 > t <- train(lrn, classif)

 Using automatic sigma estimation (sigest) for RBF or laplace kernel

 > res1 <- predict(t, newdata = data.frame(x2,x1,x3))
 > res1

 Prediction: 50 observations
 predict.type: prob
 threshold: a=0.33,b=0.33,c=0.33
 time: 0.01
      prob.a    prob.b    prob.c response
 1 0.2110131 0.3817773 0.4072095        c
 2 0.1551583 0.4066868 0.4381549        c
 3 0.4305353 0.3092737 0.2601910        a
 4 0.2160050 0.4142465 0.3697485        b
 5 0.1852491 0.3789849 0.4357659        c
 6 0.5879579 0.2269832 0.1850589        a

 > res2 <- predict(t$learner.model, data.frame(x2,x1,x3))
 > res2
  [1] c c a b c a b a c c b c b a c b c a a b c b c c a b b b a a b a c b a c c c
 [39] c a a b c b b b b a b b
 Levels: a b c
!> res1$data$response == res2
  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 [13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
 [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [49]  TRUE FALSE

The predictions are not identical. 预测并不相同。 Following mlr's tutorial page on prediction, I don't see why the results would differ. 在mlr的预测指南页面上,我看不出结果为何会有所不同。 Thanks for your help. 谢谢你的帮助。

----- -----

Update: When I do the same with a random forest model, the two vectors are equal. 更新:当我对随机森林模型执行相同操作时,两个向量相等。 Is this because SVM is scale dependent and random forest is not? 这是因为SVM取决于规模,而随机森林却不取决于规模吗?

 > library(randomForest)

 > classif <- makeClassifTask(id = "example", data = d, target = "C")
 > lrn <- makeLearner("classif.randomForest", predict.type = "prob", fix.factors.prediction = T)
 > t <- train(lrn, classif)
 >
 > res1 <- predict(t, newdata = data.frame(x2,x1,x3))
 > res1
 Prediction: 50 observations
 predict.type: prob
 threshold: a=0.33,b=0.33,c=0.33
 time: 0.00
   prob.a prob.b prob.c response
 1  0.654  0.228  0.118        a
 2  0.742  0.090  0.168        a
 3  0.152  0.094  0.754        c
 4  0.092  0.832  0.076        b
 5  0.748  0.100  0.152        a
 6  0.680  0.098  0.222        a
 >
 > res2 <- predict(t$learner.model, data.frame(x2,x1,x3))
 > res2
  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  a  a  c  b  a  a  a  c  a  b  b  b  b  c  c  a  b  b  a  c  b  a  c  c  b  c
 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
  a  a  b  a  c  c  c  b  c  b  c  a  b  c  c  b  c  b  c  a  c  c  b  b
 Levels: a b c
 >
 > res1$data$response == res2
  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [46] TRUE TRUE TRUE TRUE TRUE

---- ----

Another Update: If I change predict.type to "response" from "prob", the two svm prediction vectors agree with each other. 另一个更新:如果我将“ probest.type”从“ prob”更改为“ response”,则两个svm预测向量彼此一致。 I'm going to look into the differences of these types, I had thought that "prob" gave the same results but also gave probabilities. 我将研究这些类型的差异,我以为“概率”给出了相同的结果,但也给出了概率。 Maybe this isn't the case? 也许不是这样吗?

As you found out, the source of the "error" is that mlr and kernlab have different defaults for the type of predictions. 正如您所发现的,“错误”的根源是mlrkernlab对于预测类型具有不同的默认值。

mlr maintains quite a bit of internal "state" and checks for each learner with respect to parameters of that learner and how training and testing are handled. mlr维护相当多的内部“状态”,并针对每个学习者的参数以及该学习者的训练和测试方式进行检查。 You can get the type of prediction a learner will make with lrn$predict.type , which in your case gives "prob" . 您可以使用lrn$predict.type获得学习者将进行的预测的类型,在您的情况下为"prob" If you want to know all the gory details, have a look at the implementation of classif.ksvm . 如果您想了解所有细节,请查看classif.ksvm的实现

It is not recommended to mix mlr -wrapped learners and the "raw" learners like you do in the example and it shouldn't be necessary to do this. 不建议像示例中那样混合使用mlr封装的学习器和“原始”学习器,并且不必这样做。 If you do mix them, things like you've found will happen -- so when using mlr , use only the mlr constructs to train models, make predictions, etc. 如果混合使用它们,将会发现类似的情况-因此,使用mlr使用mlr构造来训练模型,进行预测等。

mlr does have test to make sure that the "raw" and the wrapped learner produce the same output, see eg the one for classif.ksvm . mlr确实进行了测试,以确保“原始”学习者和包装classif.ksvm学习者产生相同的输出,请参见例如classif.ksvm

The answer lies here: 答案就在这里:

Why are probabilities and response in ksvm in R not consistent? 为什么R中的ksvm中的概率和响应不一致?

In short, ksvm type = "probabilities" gives different results than type = "response". 简而言之,ksvm type =“ probabilities”提供的结果与type =“ response”不同。

If I run 如果我跑步

 > res2 <- predict(t$learner.model, data.frame(x2,x1,x3), type = "probabilities")
 > res2

then I get the same result as res1 above (type = "response" was default). 那么我得到的结果与上面的res1相同(默认为type =“ response”)。

Unfortunately it seems that classifying an image based on the probabilities doesn't do as well as using the "response". 不幸的是,根据概率对图像进行分类似乎不如使用“响应”进行分类。 Maybe that is still the best way to estimate the certainty of a classification? 也许这仍然是估计分类确定性的最佳方法?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM