简体   繁体   English

分组数据上的yardstick :: rmse返回错误和不正确的结果

[英]yardstick::rmse on grouped data returns error and incorrect results

I wanted to evaluate the performance of several regression model and used the yardstick package to calculate the RMSE. 我想评估几个回归模型的性能,并使用yardstick包装计算RMSE。 Here is some example data 这是一些示例数据

  model obs pred
1     A   1    1
2     B   1    2
3     C   1    3

When I run the following code 当我运行以下代码

library(yardstick)
library(dplyr)
dat %>%
 group_by(model) %>%
 summarise(RMSE = yardstick::rmse(truth = obs, estimate = pred))

I get the following error 我收到以下错误

Error in summarise_impl(.data, dots) : no applicable method for 'rmse' applied to an object of class "c('double', 'numeric')". summarise_impl(.data,点)中的错误:没有适用于“ rmse”的适用方法应用于类“ c('double','numeric')”的对象。

However, when I explicitly supply . 但是,当我明确提供时. as the first argument (which should not be necessary, I thought), I get no error, but the results are incorrect. 作为第一个参数(这不应该是必要的,我认为),我没有错误,但结果是不正确的。

dat %>%
 group_by(model) %>%
 summarise(RMSE = yardstick::rmse(., truth = obs, estimate = pred))
# A tibble: 3 x 2
  model   RMSE
  <fctr> <dbl>
1 A       1.29
2 B       1.29
3 C       1.29

I was expecting the following 我期待以下

# A tibble: 3 x 2
  model   RMSE
  <fctr> <dbl>
1 A       0
2 B       1.00
3 C       2.00

I know that there are alternatives to this function but still I don't understand this behavior. 我知道此功能还有其他选择,但我仍然不了解这种行为。

data 数据

dat <- structure(list(model = structure(1:3, .Label = c("A", "B", "C"), class = "factor"), obs = c(1, 1, 1), pred = 1:3), .Names = c("model", "obs", "pred"), row.names = c(NA, -3L), class = "data.frame")

Based on the help page ?yardstick::rmse , it looks like it expects a data frame as its first argument, which explains the error you're getting. 根据帮助页面?yardstick::rmse ,它似乎希望将数据框作为第一个参数,从而说明您遇到的错误。

I'm not quite up to speed on that new package, but it seems that the function expects to calculate a summary statistic across a data frame, rather than a row-by-row calculation. 我对新程序包的了解还不够快,但是该函数似乎希望跨数据帧计算摘要统计信息,而不是逐行计算。 To force it to run it row-by-row, you'd need to make it think that each row is its own dataframe, and apply the function within each of those data frames: 要强制它逐行运行,您需要使其认为每一行都是其自己的数据帧,并在每个数据帧中应用该函数:

library(tidyverse)
dat %>%
  group_by(model) %>%
  nest() %>% 
  mutate(rmse_res = map(data, rmse, truth = obs, estimate = pred)) %>% 
  unnest(rmse_res)

# A tibble: 3 x 3
  model  data              rmse
  <fctr> <list>           <dbl>
1 A      <tibble [1 x 2]>  0   
2 B      <tibble [1 x 2]>  1.00
3 C      <tibble [1 x 2]>  2.00

We can use the do function to apply the rmse function to every group. 我们可以使用do函数将rmse函数应用于每个组。

dat %>%
  group_by(model) %>%
  do(data_frame(model = .$model[1], obs = .$obs[1], pred = .$pred[1], 
     RMSE = yardstick::rmse(., truth = obs, estimate = pred)))
# # A tibble: 3 x 4
# # Groups: model [3]
# model    obs  pred  RMSE
#  <fctr> <dbl> <int> <dbl>
# 1 A       1.00     1  0   
# 2 B       1.00     2  1.00
# 3 C       1.00     3  2.00

Or we can split the data frame and apply the rmse function. 或者我们可以拆分数据帧并应用rmse函数。

dat %>%
  mutate(RMSE = dat %>%
           split(.$model) %>%
           sapply(yardstick::rmse, truth = obs, estimate = pred))
#   model obs pred RMSE
# 1     A   1    1    0
# 2     B   1    2    1
# 3     C   1    3    2

Or we can nest the obs and pred column to a list column and then apply the rmse function. 或者,我们可以将obspred列嵌套到列表列,然后应用rmse函数。

library(tidyr)

dat %>%
  nest(obs, pred) %>%
  mutate(RMSE = sapply(data, yardstick::rmse, truth = obs, estimate = pred)) %>%
  unnest()
#   model RMSE obs pred
# 1     A    0   1    1
# 2     B    1   1    2
# 3     C    2   1    3

The output of these three methods are a little bit different, but all contain the right RMSE calculation. 这三种方法的输出略有不同,但是都包含正确的RMSE计算。 Here I use the microbenchmark package to conduct a performance evaluation. 在这里,我使用microbenchmark软件包进行性能评估。

library(microbenchmark)

microbenchmark(m1 = {dat %>%
    group_by(model) %>%
    do(data_frame(model = .$model[1], obs = .$obs[1], pred = .$pred[1], 
                  RMSE = yardstick::rmse(., truth = obs, estimate = pred)))},
    m2 = {dat %>%
        mutate(RMSE = dat %>%
                 split(.$model) %>%
                 sapply(yardstick::rmse, truth = obs, estimate = pred))},
    m3 = {dat %>%
        nest(obs, pred) %>%
        mutate(RMSE = sapply(data, yardstick::rmse, truth = obs, estimate = pred)) %>%
        unnest()})

# Unit: milliseconds
# expr      min       lq     mean   median       uq       max neval
#   m1 43.18746 46.71055 50.23383 48.46554 51.05639 174.46371   100
#   m2 14.08516 14.78093 16.14605 15.74505 16.89936  24.02136   100
#   m3 28.99795 30.90407 32.71092 31.89954 33.94729  44.57953   100

The result shows that m2 is the fastest, while m1 is the slowest. 结果表明, m2最快,而m1最慢。 I think the implication is do operation is usually slower then other methods, so if possible, we should avoid the do operation. 我认为这意味着do操作通常比其他方法慢,因此,如果可能,应避免执行do操作。 Although m2 is the fastest, personally I like the syntax of m3 the best. 尽管m2是最快的,但我个人最喜欢m3的语法。 The nested data frame will allow us to easily summarize information between different models or different groups. 嵌套的数据框将使我们能够轻松汇总不同模型或不同组之间的信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM