`purrr` 替代按行 function 确定基于复杂规则集的事件日期

Question

我正在与一个客户合作，该客户希望提供一个输入电子表格，其中包含在给定年份中应该何时发生某些事件的文本描述。 每个事件（至少有 200 个）是一个单独的行，包含一个关于何时发生的复杂规则，例如， “10 月 1 日之前的第一个星期六”或“最接近 12 月 1 日的星期五” 。 也有几次事件只是发生在特定日期，但这种情况很少见。 但是，实际的电子表格有大约 15 个列来控制每个事件的开始日期，因此我需要用来计算开始日期的逻辑非常深入。

我想出了一种使用 function 和循环遍历我的data.frame每一行的循环来计算开始日期的方法，但我想知道是否有更有效的tidyverse或purrr解决方案来解决这个问题。 是否有可能（或可取）对这个问题的解决方案进行矢量化？

这是我能想象到的最小、最紧凑的示例的当前（工作）解决方案。 对于更复杂的现实世界输入，我可以让它更高效、更易读吗？

library(tidyverse)
library(lubridate)

# Bring in demo data that describes 3 events, and when they should each start.

demo <- structure(list(Event = c("Gala", "Celebration", "Wrap-up"), date_start
= structure(c(18871, NA, NA), class = "Date"), weekday_near = c(NA,
"Saturday", "Friday" ), near_description = c(NA, "before", "closest to"),
near_date = structure(c(NA, 18901, 18962), class = "Date")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))

这是demo数据的样子：

Event       date_start weekday_near near_description near_date 
Gala        2021-09-01 NA           NA               NA        
Celebration NA         Saturday     before           2021-10-01
Wrap-up     NA         Friday       closest to       2021-12-01

现在，确定每个活动的开始日期——晚会、庆典和总结。

# Create a tibble that contains all possible dates for these events this year.

datedb <- tibble(date = seq(make_date(2021, 9, 1), make_date(2021, 12, 31), by = 1),
                 wday = wday(date, label = TRUE, abbr = FALSE))


# Write function meant to determine event date for each row of the dataframe.

determine_date <- function(df){
  
  # define variables that are easier to read
  # this part makes me squeamish - 
  # there's gotta be a better way to do this with the tidyverse
  event_date_exact <- df[["date_start"]]
  event_near_wday <- df[["weekday_near"]]
  event_near_desc <- df[["near_description"]]
  event_near_date <- df[["near_date"]]
  
  # Event date - if there is an exact date for the event, choose it as the event date.
  if (!is.na(event_date_exact)) {
    event_date <- event_date_exact
  
  # Otherwise, if the date is dependent on another date, figure out when it should be:
  } else {
    event_date_vec <- datedb %>% filter(wday == event_near_wday) %>% pull(date)
    event_date <- 
      case_when(
        # If you're looking for the closest weekday to a particular date:
        event_near_desc == "closest to" ~ event_date_vec[which(abs(event_date_vec - event_near_date) == 
                         min(abs(event_date_vec - event_near_date), na.rm = TRUE))],
        # If you're looking for the first weekday before that weekday
        event_near_desc == "before" ~ rev(event_date_vec[which(event_date_vec - event_near_date < 0)])[1],
        # If neither of these worked, output NA to check why 
        TRUE ~ NA_Date_
      )
       }
}

# create empty vector to store results
start_dates <- lubridate::ymd()

for (i in 1:nrow(demo)) {
  start_dates[i] <- determine_date(demo[i,])
}

# add start dates back to original demo dataframe
demo$start_date <- start_dates

所需的 output：

注意新的start_date列

demo

Event       date_start weekday_near near_description near_date    start_date
Gala        2021-09-01 NA           NA               NA           2021-09-01
Celebration NA         Saturday     before           2021-10-01   2021-09-25
Wrap-up     NA         Friday       closest to       2021-12-01   2021-12-03

Answer 1

如果您想对 function 进行矢量化，实际上只是调用了mapply 。 所以，如果你想使用purrr风格的编码，你可能只想修改你的 function arguments 如下：

设置：

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

# Bring in demo data that describes 3 events, and when they should each start.

demo <- structure(list(Event = c("Gala", "Celebration", "Wrap-up"), date_start
                       = structure(c(18871, NA, NA), class = "Date"), weekday_near = c(NA,
                                                                                       "Saturday", "Friday" ), near_description = c(NA, "before", "closest to"),
                       near_date = structure(c(NA, 18901, 18962), class = "Date")), row.names = c(NA,
                                                                                                  -3L), class = c("tbl_df", "tbl", "data.frame"))

datedb <- tibble(date = seq(make_date(2021, 9, 1), make_date(2021, 12, 31), by = 1),
                 wday = wday(date, label = TRUE, abbr = FALSE))

这是 function 的重构版本。

使用case_when而不是switch语句真的取决于你。 我选择使用switch ，因为这个 function 旨在在 pmap 调用中调用，即我们希望它只检查单个值。

#write a function that expects 4 input values
#vectorize/pmap over each.
determine_date2 <- function(date_start, weekday_near, near_desc, near_date){
  event_vec <- datedb %>% filter(wday == weekday_near) %>% pull(date)
  event_date <-
    if(!is.na(date_start)){
      date_start
    } else if(!is.na(near_desc)){
      switch(
        near_desc,
        `closest to` = event_vec[which(abs(event_vec - near_date) == min(abs(event_vec - near_date), na.rm = TRUE))],
        before = rev(event_vec[which(event_vec - near_date < 0)])[1],
        NA_Date_
      )
    } else {
      NA_Date_
    }
  event_date
}

实际上，我刚刚发现实际上并没有pmap_date变体，但是我在下面生成的内容应该足以替代。

pmap_date <- function(.l, .f, ...){
  res <- pmap(.l, .f, ...)
  check_res <- map_lgl(res, ~is.Date(.x)&&is_scalar_vector(.x))
  if(!all(check_res)){
    rlang::abort(glue::glue("all results must return a scalar date. offending entries: ",glue::glue_collapse("{!which(check_res)}", sep = ", ")))
  }
  
  return(reduce(res, c))
}

现在我们应该能够在变异 function 中使用pmap_date

demo %>%
  mutate(
    start_dates = pmap_date(list(date_start, weekday_near, near_description, near_date), determine_date2)
  )
#> # A tibble: 3 x 6
#>   Event       date_start weekday_near near_description near_date  start_dates
#>   <chr>       <date>     <chr>        <chr>            <date>     <date>     
#> 1 Gala        2021-09-01 <NA>         <NA>             NA         2021-09-01 
#> 2 Celebration NA         Saturday     before           2021-10-01 2021-09-25 
#> 3 Wrap-up     NA         Friday       closest to       2021-12-01 2021-12-03

如果您愿意，您可以制作“矢量化”包装器 function，就像您自己调用Vectorize function 一样：

v_determine_date2 <- function(date_start, weekday_near, near_desc, near_date) pmap_date(list(date_start, weekday_near, near_desc, near_date), determine_date2)

demo %>%
  mutate(
    start_dates = v_determine_date2(date_start, weekday_near, near_description, near_date)
  )
#> # A tibble: 3 x 6
#>   Event       date_start weekday_near near_description near_date  start_dates
#>   <chr>       <date>     <chr>        <chr>            <date>     <date>     
#> 1 Gala        2021-09-01 <NA>         <NA>             NA         2021-09-01 
#> 2 Celebration NA         Saturday     before           2021-10-01 2021-09-25 
#> 3 Wrap-up     NA         Friday       closest to       2021-12-01 2021-12-03

^{由代表 package (v1.0.0) 于 2021 年 5 月 11 日创建}

`purrr` 替代按行 function 确定基于复杂规则集的事件日期

问题描述

所需的 output：

1 个解决方案

解决方案1
1 已采纳 2021-05-11 15:10:04

`purrr` 替代按行 function 确定基于复杂规则集的事件日期

问题描述

所需的 output：

1 个解决方案

解决方案1 1 已采纳 2021-05-11 15:10:04

解决方案1
1 已采纳 2021-05-11 15:10:04