[英]yardstick::rmse on grouped data returns error and incorrect results
I wanted to evaluate the performance of several regression model and used the yardstick
package to calculate the RMSE. 我想评估几个回归模型的性能,并使用
yardstick
包装计算RMSE。 Here is some example data 这是一些示例数据
model obs pred
1 A 1 1
2 B 1 2
3 C 1 3
When I run the following code 当我运行以下代码
library(yardstick)
library(dplyr)
dat %>%
group_by(model) %>%
summarise(RMSE = yardstick::rmse(truth = obs, estimate = pred))
I get the following error 我收到以下错误
Error in summarise_impl(.data, dots) : no applicable method for 'rmse' applied to an object of class "c('double', 'numeric')".
summarise_impl(.data,点)中的错误:没有适用于“ rmse”的适用方法应用于类“ c('double','numeric')”的对象。
However, when I explicitly supply .
但是,当我明确提供时
.
as the first argument (which should not be necessary, I thought), I get no error, but the results are incorrect. 作为第一个参数(这不应该是必要的,我认为),我没有错误,但结果是不正确的。
dat %>%
group_by(model) %>%
summarise(RMSE = yardstick::rmse(., truth = obs, estimate = pred))
# A tibble: 3 x 2
model RMSE
<fctr> <dbl>
1 A 1.29
2 B 1.29
3 C 1.29
I was expecting the following 我期待以下
# A tibble: 3 x 2
model RMSE
<fctr> <dbl>
1 A 0
2 B 1.00
3 C 2.00
I know that there are alternatives to this function but still I don't understand this behavior. 我知道此功能还有其他选择,但我仍然不了解这种行为。
data 数据
dat <- structure(list(model = structure(1:3, .Label = c("A", "B", "C"), class = "factor"), obs = c(1, 1, 1), pred = 1:3), .Names = c("model", "obs", "pred"), row.names = c(NA, -3L), class = "data.frame")
Based on the help page ?yardstick::rmse
, it looks like it expects a data frame as its first argument, which explains the error you're getting. 根据帮助页面
?yardstick::rmse
,它似乎希望将数据框作为第一个参数,从而说明您遇到的错误。
I'm not quite up to speed on that new package, but it seems that the function expects to calculate a summary statistic across a data frame, rather than a row-by-row calculation. 我对新程序包的了解还不够快,但是该函数似乎希望跨数据帧计算摘要统计信息,而不是逐行计算。 To force it to run it row-by-row, you'd need to make it think that each row is its own dataframe, and apply the function within each of those data frames:
要强制它逐行运行,您需要使其认为每一行都是其自己的数据帧,并在每个数据帧中应用该函数:
library(tidyverse)
dat %>%
group_by(model) %>%
nest() %>%
mutate(rmse_res = map(data, rmse, truth = obs, estimate = pred)) %>%
unnest(rmse_res)
# A tibble: 3 x 3
model data rmse
<fctr> <list> <dbl>
1 A <tibble [1 x 2]> 0
2 B <tibble [1 x 2]> 1.00
3 C <tibble [1 x 2]> 2.00
We can use the do
function to apply the rmse
function to every group. 我们可以使用
do
函数将rmse
函数应用于每个组。
dat %>%
group_by(model) %>%
do(data_frame(model = .$model[1], obs = .$obs[1], pred = .$pred[1],
RMSE = yardstick::rmse(., truth = obs, estimate = pred)))
# # A tibble: 3 x 4
# # Groups: model [3]
# model obs pred RMSE
# <fctr> <dbl> <int> <dbl>
# 1 A 1.00 1 0
# 2 B 1.00 2 1.00
# 3 C 1.00 3 2.00
Or we can split the data frame and apply the rmse
function. 或者我们可以拆分数据帧并应用
rmse
函数。
dat %>%
mutate(RMSE = dat %>%
split(.$model) %>%
sapply(yardstick::rmse, truth = obs, estimate = pred))
# model obs pred RMSE
# 1 A 1 1 0
# 2 B 1 2 1
# 3 C 1 3 2
Or we can nest the obs
and pred
column to a list column and then apply the rmse
function. 或者,我们可以将
obs
和pred
列嵌套到列表列,然后应用rmse
函数。
library(tidyr)
dat %>%
nest(obs, pred) %>%
mutate(RMSE = sapply(data, yardstick::rmse, truth = obs, estimate = pred)) %>%
unnest()
# model RMSE obs pred
# 1 A 0 1 1
# 2 B 1 1 2
# 3 C 2 1 3
The output of these three methods are a little bit different, but all contain the right RMSE calculation. 这三种方法的输出略有不同,但是都包含正确的RMSE计算。 Here I use the
microbenchmark
package to conduct a performance evaluation. 在这里,我使用
microbenchmark
软件包进行性能评估。
library(microbenchmark)
microbenchmark(m1 = {dat %>%
group_by(model) %>%
do(data_frame(model = .$model[1], obs = .$obs[1], pred = .$pred[1],
RMSE = yardstick::rmse(., truth = obs, estimate = pred)))},
m2 = {dat %>%
mutate(RMSE = dat %>%
split(.$model) %>%
sapply(yardstick::rmse, truth = obs, estimate = pred))},
m3 = {dat %>%
nest(obs, pred) %>%
mutate(RMSE = sapply(data, yardstick::rmse, truth = obs, estimate = pred)) %>%
unnest()})
# Unit: milliseconds
# expr min lq mean median uq max neval
# m1 43.18746 46.71055 50.23383 48.46554 51.05639 174.46371 100
# m2 14.08516 14.78093 16.14605 15.74505 16.89936 24.02136 100
# m3 28.99795 30.90407 32.71092 31.89954 33.94729 44.57953 100
The result shows that m2
is the fastest, while m1
is the slowest. 结果表明,
m2
最快,而m1
最慢。 I think the implication is do
operation is usually slower then other methods, so if possible, we should avoid the do
operation. 我认为这意味着
do
操作通常比其他方法慢,因此,如果可能,应避免执行do
操作。 Although m2
is the fastest, personally I like the syntax of m3
the best. 尽管
m2
是最快的,但我个人最喜欢m3
的语法。 The nested data frame will allow us to easily summarize information between different models or different groups. 嵌套的数据框将使我们能够轻松汇总不同模型或不同组之间的信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.