用於計算精度和召回率的Tidyverse語法

Question

我正在嘗試計算我的數據框中每個組的AUC，精度，召回，准確度（我有一個數據幀，它預測了連接的三個不同模型的數據）。

這樣做的tidyverse語法是什么？ 我想使用Max Kuhn的尺度包來計算這些指標。

這是一個示例df，這是我到目前為止的地方：

> library(tidyverse)
> library(yardstick)
> 
> sample_df <- data_frame(
+     group_type = rep(c('a', 'b', 'c'), each = 5),  # repeats each element 5 times
+     true_label = as.factor(rbinom(15, 1, 0.3)),    # generates 1 with 30% prob
+     pred_prob = runif(15, 0, 1)                    # generates 15 decimals between 0 and 1 from uniform dist
+ ) %>%
+     mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
> 
> sample_df
# A tibble: 15 x 4
   group_type true_label pred_prob pred_label
   <chr>      <fct>          <dbl> <fct>     
 1 a          1             0.327  0         
 2 a          1             0.286  0         
 3 a          0             0.0662 0         
 4 a          0             0.993  1         
 5 a          0             0.835  1         
 6 b          0             0.975  1         
 7 b          0             0.436  0         
 8 b          0             0.585  1         
 9 b          0             0.478  0         
10 b          1             0.541  1         
11 c          1             0.247  0         
12 c          0             0.608  1         
13 c          0             0.215  0         
14 c          0             0.937  1         
15 c          0             0.819  1         
>

指標：

> # metrics for the full data
> precision(sample_df, truth = true_label, estimate = pred_label)
[1] 0.5714286
> recall(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3636364
> accuracy(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3333333
> roc_auc(sample_df, truth = true_label, pred_prob)
[1] 0.7727273
>

現在，我如何獲取數據集中每個組的這些指標？

sample_df %>%
    group_by(group_type) %>%
    summarize(???)

Answer 1

使用unexst的示例：

   sample_df %>% 
     group_by(group_type) %>% 
     do(auc = roc_auc(., true_label, pred_prob),
         acc = accuracy(., true_label, pred_label),
         recall = recall(., true_label, pred_label),
         precision = precision(., true_label, pred_label)) %>% unnest

然而，

我實際上建議不要使用尺碼，因為它與dplyr總結不一致。 實際上，它只是在引擎蓋下使用ROCR包。 我只是創建自己的函數，接受兩個變量。

yardstick是有缺陷的，因為它需要data.frame作為它的第一個輸入，它試圖太聰明。 在dplyr框架下，由於函數的summarize和mutate ，因此函數已經看到data.frame沒有顯式data參數的變量。

Answer 2

正如其他人所指出的那樣， yardstick的函數對於分組數據幀並不是很好（至少到目前為止）。 解決方法可能是使用嵌套數據。

為了減少復制，編寫一個簡單的包裝函數可能也是一個好主意，該函數計算一次調用中所需的所有匯總指標。 這是一個如何做到這一點的例子：

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-09

首先設置：

library(tidyverse)
library(yardstick)
set.seed(1)

# Given sample data
sample_df <- data_frame(
    group_type = rep(c('a', 'b', 'c'), each = 5),  # repeats each element 5 times
    true_label = as.factor(rbinom(15, 1, 0.3)),    # generates 1 with 30% prob
    pred_prob = runif(15, 0, 1)                    # generates 15 decimals between 0 and 1 from uniform dist
) %>%
    mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
#> Warning: package 'bindrcpp' was built under R version 3.3.3

這是包裝：

# Wrapper to calculate several metrics from same data
performance_metrics <- function(data, truth, estimate, prob) {
  metrics <- lst(precision, recall, accuracy)  # these all share arguments
  values <- invoke_map_df(metrics, list(list(data)), truth, estimate)

  roc <- roc_auc(sample_df, truth, prob)  # bit different here
  bind_cols(values, roc_auc = roc)
}

# Wrap the wrapper with default arguments
metrics <- partial(performance_metrics,
                   truth    = "true_label",
                   estimate = "pred_label",
                   prob     = "pred_prob")

並通過嵌套數據應用於組：

sample_df %>% 
  nest(-group_type) %>% 
  mutate(metrics = map(data, metrics)) %>% 
  unnest(metrics)
#> # A tibble: 3 x 6
#>   group_type             data precision    recall accuracy   roc_auc
#>        <chr>           <list>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1          a <tibble [5 x 3]> 0.5000000 0.2500000      0.2 0.5909091
#> 2          b <tibble [5 x 3]> 0.6666667 0.6666667      0.6 0.5909091
#> 3          c <tibble [5 x 3]> 0.7500000 0.7500000      0.6 0.5909091

Answer 3

我設法通過將數據框吐出到列表並將函數映射到每個列表元素來實現：

library(tidyverse)
library(yardstick)
sample_df %>%
  split(.$group_type) %>%
  map_dfr(precision, true_label, pred_label) 
#output
## A tibble: 1 x 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.500 0.667  1.00

似乎group_by還沒有受到yardstick函數的支持

這也有效：

sample_df %>%
  split(.$group_type) %>%
  map_dfr(function(x){
    prec = precision(x, true_label, pred_label)
    rec = recall(x, true_label, pred_label)
    return(data.frame(prec, rec))
  })

Answer 4

我在http://r4ds.had.co.nz/many-models.html中使用了這個例子。它使用了nest，但也按照你的要求使用了精度。

library(tidyverse)
library(yardstick)
sample_df <- data_frame(group_type = rep(c('a', 'b', 'c'), each = 5),  # repeats each element 5 times 
                        true_label = as.factor(rbinom(15, 1, 0.3)),    # generates 1 with 30% prob 
                        pred_prob = runif(15, 0, 1)                    # generates 15 decimals between 0 and 1 from uniform dist 
                        ) %>% 
  mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))

by_group_type <- sample_df %>% group_by(group_type) %>% nest()
stick_m_1 <- function(df){
  precision(df,truth = true_label, estimate = pred_label)
}
models <- map(by_group_type$data,stick_m_1)
models

用於計算精度和召回率的Tidyverse語法

問題描述

4 個解決方案

解決方案1
1 已采納 2018-02-09 19:38:15

解決方案2
1 2018-02-09 21:01:38

解決方案3
0 2018-02-09 19:29:48

解決方案4
0 2018-02-09 20:00:44

用於計算精度和召回率的Tidyverse語法

問題描述

4 個解決方案

解決方案1 1 已采納 2018-02-09 19:38:15

解決方案2 1 2018-02-09 21:01:38

解決方案3 0 2018-02-09 19:29:48

解決方案4 0 2018-02-09 20:00:44

解決方案1
1 已采納 2018-02-09 19:38:15

解決方案2
1 2018-02-09 21:01:38

解決方案3
0 2018-02-09 19:29:48

解決方案4
0 2018-02-09 20:00:44