如何在dplyr group_by中使用自定義函數

Question

我試圖在總結或使用包含group_by的dplyr流中使用自定義函數（返回標量）。 當我直接調用該函數時，該函數起作用，但是只要它跟隨group_by，它就無法起作用。

請參閱我的代碼以了解嘗試的內容。 我設法使其正常運行，但我覺得這是一種怪誕的方式-我想了解為什么它無法正常運行，正如我期望的那樣：

加載數據並定義基尼系數函數

## Load required libraries
library(dplyr)
library(tidyr)
library(ROCR)

set.seed(0)
## Generate fake data
df1 <- data.frame(predictions = seq(0,1,.01), date = seq(as.Date("2009-01-01"), by = "month", length.out = 101), labels = sample(c(0,1), replace=TRUE, size=101), grouping = rep('a',101))
df2 <- data.frame(predictions = seq(0,1,.01), date = seq(as.Date("2010-01-01"), by = "month", length.out = 101), labels = sample(c(0,1), replace=TRUE, size=101), grouping = rep('b',101))

df <- rbind(df1,df2)

## Gini coefficient calculation function
dplyr_Gini <- function(df, predictions, labels, label.ordering = NULL,...){
  predictions = enquo(predictions)
  labels = enquo(labels)

  predictions <- df %>% select(!!predictions) %>% .[[1]]
  labels <- df %>% select(!!labels) %>% .[[1]]

  if(length(unique(labels)) != 2){
    return(NA)
  }

  pred <- prediction(predictions, labels, label.ordering)
  auc.perf = performance(pred, measure = "auc")
  gini =  2*auc.perf@y.values[[1]] - 1
  return(gini)
}

## test dplyr_Gini - works as expected
dplyr_Gini(df1,predictions, labels)
> [1] -0.05494505
dplyr_Gini(df2,predictions, labels)
> [1] 0.09456265

不起作用-在group_by之后使用dplyr_Gini。

## Wrapper function for using dplyr_Gini in group_by
calc_Gini <- function(df, group, predictions, labels){
  predictions <- enquo(predictions)
  labels = enquo(labels)

  df %>% filter(grouping %in% group) %>%
    group_by(grouping) %>% 
    summarise(min.date = min(date),
              max.date = max(date),
              Gini = dplyr_Gini(.,!!predictions, !!labels)) %>% 
    ungroup()  
}

calc_Gini(df,group = c('a','b'),predictions, labels)
> # Adding missing grouping variables: `grouping`
> # Adding missing grouping variables: `grouping`
> # Error in prediction(predictions, labels, label.ordering) : 
> # Format of predictions is invalid.

工作-在group_by之后使用do和unnest使用dplyr_Gini

## Wrapper function that works for using dplyr_Gini in group_by - but is kind of hacky.
calc_Gini_working <- function(df, group, predictions, labels){
  predictions <- enquo(predictions)
  labels = enquo(labels)

  df %>% filter(grouping %in% group) %>%
    group_by(grouping) %>% 
    mutate(min.date = min(date),
              max.date = max(date)) %>% 
    group_by(grouping, min.date, max.date) %>% 
    do(Gini = dplyr_Gini(.,!!predictions, !!labels)) %>% 
    unnest() %>% 
    ungroup()

}

calc_Gini_working(df,group = c('a','b'),predictions, labels)
>
# A tibble: 2 x 4
  grouping min.date   max.date      Gini
  <fct>    <date>     <date>       <dbl>
1 a        2009-01-01 2017-05-01 -0.0549
2 b        2010-01-01 2018-05-01  0.0946

我的印象是calc_Gini函數可以正常工作，因為我剛剛在group_by之后的摘要中添加了自定義函數（dplyr_Gini）。

正如你所看到的，如果我在一個做包裝dplyr_Gini然后UNNEST它工作的結果-但我不知道為什么。

Answer 1

根據dplyr_Gini的構建方式，一個選項是group_split然后使用map

library(tidyverse)
calc_Gini <- function(df, group, predictions, labels){
  predictions <- enquo(predictions)
  labels = enquo(labels)

  df %>% filter(grouping %in% group) %>%
    group_split(grouping, remove = FALSE) %>% 
    map_dfr(., ~               
              tibble(grouping = first(.x$grouping), min.date = min(.x$date), 
                     max.date = max(.x$date), 
                     Gini = dplyr_Gini(.x, !!predictions, !!labels)))

}


calc_Gini(df,group = c('a','b'),predictions, labels)
# A tibble: 2 x 4
#  grouping min.date   max.date      Gini
#  <fct>    <date>     <date>       <dbl>
#1 a        2009-01-01 2017-05-01 -0.0549
#2 b        2010-01-01 2018-05-01  0.0946

如何在dplyr group_by中使用自定義函數

問題描述

加載數據並定義基尼系數函數

不起作用-在group_by之后使用dplyr_Gini。

工作-在group_by之后使用do和unnest使用dplyr_Gini

1 個解決方案

解決方案1
0 2019-04-30 03:00:07

如何在dplyr group_by中使用自定義函數

問題描述

加載數據並定義基尼系數函數

不起作用-在group_by之后使用dplyr_Gini。

工作-在group_by之后使用do和unnest使用dplyr_Gini

1 個解決方案

解決方案1 0 2019-04-30 03:00:07

解決方案1
0 2019-04-30 03:00:07