简体   繁体   English

用户定义的函数,在R中带有mutate和case_when

[英]User defined function with mutate and case_when in R

I would like to know if/how can I turn the call bellow into a function that can be used in a task that I do fairly often with my data. 我想知道是否/如何将调用波纹转换为可以在我经常处理数据的任务中使用的函数。 Sadly, I can't figure out how to design function from the call that involves mutate , and case_when , both of these functions rely on dplyr package and require number of additional arguments. 可悲的是,我无法从涉及mutatecase_when的调用中弄清楚如何设计函数,这两个函数都依赖于dplyr包,并且需要多个附加参数。

Alternatively, the call itself seems redundant to me with so many case_when , perhaps it's possible to reduce how many times its used. 另外,对于这么多case_when ,呼叫本身对我来说似乎是多余的,也许可以减少使用次数。

Any help and information about alternative approaches is welcomed. 欢迎提供有关替代方法的任何帮助和信息。

The call looks like this: 呼叫看起来像这样:

library(dplyr)
library(stringr)

test_data %>%
  mutate(
    multipleoptions_o1 = case_when(
      str_detect(multipleoptions, "option1") ~ 1,
      is.na(multipleoptions) ~ NA_real_,
      TRUE ~ 0),
    multipleoptions_o2 = case_when(
      str_detect(multipleoptions, "option2") ~ 1,
      is.na(multipleoptions) ~ NA_real_,
      TRUE ~ 0),
    multipleoptions_o3 = case_when(
      str_detect(multipleoptions, "option3") ~ 1,
      is.na(multipleoptions) ~ NA_real_,
      TRUE ~ 0),
    multipleoptions_o4 = case_when(
      str_detect(multipleoptions, "option4") ~ 1,
      is.na(multipleoptions) ~ NA_real_,
      TRUE ~ 0)
  )

Sample data: 样本数据:

structure(list(multipleoptions = c("option1", "option2", "option3", 
NA, "option2,option3", "option4")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

Desired output of the function: 该函数的期望输出:

structure(list(multipleoptions = c("option1", "option2", "option3", 
NA, "option2,option3", "option4"), multipleoptions_o1 = c(1, 
0, 0, NA, 0, 0), multipleoptions_o2 = c(0, 1, 0, NA, 1, 0), multipleoptions_o3 = c(0, 
0, 1, NA, 1, 0), multipleoptions_o4 = c(0, 0, 0, NA, 0, 1)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))

Arguments of the function should probably be: data (ie, input dataset), multipleoptions (ie, the column from data containing answer options, always one column), patterns_to_look_for (ie, str_detect patterns to look up in the multipleoptions), number_of_options , ideally the number of options can be more or less than 4, (I am not sure if it's achievable), output_columns (ie, names of new columns, it's always name or original column followed by the option number or option name). 该函数的参数可能应该是: data (即输入数据集), multipleoptions (即包含答案选项的数据中的列,始终为一列), patterns_to_look_for (即,str_detect模式以在多个选项中查找), number_of_options (理想情况下)选项的数量可以大于或小于4(我不确定是否可以实现), output_columns (即新列的名称,它始终是名称或原始列,后跟选项编号或选项名称)。

You can avoid the lengthy case_when code by splitting the options into separate elements, taking advantage of nesting/unnesting to get a single column of options, and then spreading to get a separate column for each option. 通过将选项拆分为单独的元素,利用嵌套/取消嵌套获取选项的单个列,然后扩展为每个选项获取单独的列,可以避免冗长的case_when代码。

Updated Answer 更新的答案

library(tidyverse)

# Arguments
# data     A data frame
# patterns Regular expression giving the pattern(s) at which to split the options strings
# ...      Grouping columns, the first of which must be the "options" column.
#           If options has repeated values, then there must be a second grouping 
#           column (an "ID" column) to differentiate these repeated values.
fnc = function(data, patterns, ...) {
  col = quos(...)

  data %>% 
    mutate(option=str_split(!!!col[[1]], patterns)) %>% 
    unnest %>% 
    mutate(value=1) %>% 
    group_by(!!!col) %>% 
    mutate(num_chosen = ifelse(is.na(!!!col[[1]]), 0, sum(value))) %>% 
    spread(option, value, fill=0) %>%
    select_at(vars(-matches("NA")))
}

fnc(test_data, ",", multipleoptions)
  multipleoptions num_chosen option1 option2 option3 option4 1 option1 1 1 0 0 0 2 option2 1 0 1 0 0 3 option2,option3 2 0 1 1 0 4 option3 1 0 0 1 0 5 option4 1 0 0 0 1 6 <NA> 0 0 0 0 0 
# Fake data
ops = paste0("option",1:4)

set.seed(2)
d = data_frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=","))) 
# Add missing values
d = bind_rows(d[1:5,], data.frame(var=rep(NA,3)), d[6:nrow(d),])

fnc(d %>% mutate(ID=1:n()), ",", var, ID)
  var ID num_chosen option1 option2 option3 option4 1 option1 17 1 1 0 0 0 2 option1,option2 12 2 1 1 0 0 3 option1,option2,option3 5 3 1 1 1 0 4 option1,option2,option4,option3 9 4 1 1 1 1 5 option1,option3 2 2 1 0 1 0 6 option1,option3,option4 3 3 1 0 1 1 7 option1,option4,option2 20 3 1 1 0 1 8 option1,option4,option3,option2 13 4 1 1 1 1 9 option2 11 1 0 1 0 0 10 option2,option3 23 2 0 1 1 0 11 option2,option3,option4 21 3 0 1 1 1 12 option3 1 1 0 0 1 0 13 option3 15 1 0 0 1 0 14 option3,option1 4 2 1 0 1 0 15 option3,option2,option4 14 3 0 1 1 1 16 option3,option4,option2,option1 22 4 1 1 1 1 17 option4 10 1 0 0 0 1 18 option4 16 1 0 0 0 1 19 option4 18 1 0 0 0 1 20 option4,option2,option3 19 3 0 1 1 1 21 <NA> 6 0 0 0 0 0 22 <NA> 7 0 0 0 0 0 23 <NA> 8 0 0 0 0 0 

Original Answer 原始答案

test_data %>% 
  filter(!is.na(multipleoptions)) %>% 
  mutate(option=str_split(multipleoptions, ",")) %>% 
  unnest %>% 
  mutate(value=1) %>% 
  spread(option, value)
  multipleoptions option1 option2 option3 option4 <chr> <dbl> <dbl> <dbl> <dbl> 1 option1 1 NA NA NA 2 option2 NA 1 NA NA 3 option2,option3 NA 1 1 NA 4 option3 NA NA 1 NA 5 option4 NA NA NA 1 

Packaging this into a function: 将其打包成一个函数:

fnc = function(data, col, patterns) {
  col = enquo(col)

  data %>% 
    filter(!is.na(!!col)) %>% 
    mutate(option=str_split(!!col, patterns)) %>% 
    unnest %>% 
    mutate(value=1) %>% 
    spread(option, value)
}


fnc(test_data, multipleoptions, ",")

If your real data has more than one row with the same value of multipleoptons , then this code will work only if there's also an ID column that distinguishes different rows with the same value of multipleoptions . 如果您的实际数据有超过一排用相同的值multipleoptons ,那么这段代码将只如果有也是一个工作ID列有相同的值区分不同的行multipleoptions For example: 例如:

# Fake data
ops = paste0("option",1:4)

set.seed(2)
d = data.frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=",")))

fnc(d, var, ",")

Error: Duplicate identifiers for rows (1, 27), (16, 28, 30) 错误:行(1、27),(16、28、30)的标识符重复

# Add unique row identifier
fnc(d %>% mutate(ID = 1:n()), var, ",")
  var ID option1 option2 option3 option4 1 option1 14 1 NA NA NA 2 option1,option2 9 1 1 NA NA 3 option1,option2,option3 5 1 1 1 NA 4 option1,option2,option4,option3 6 1 1 1 1 5 option1,option3 2 1 NA 1 NA 6 option1,option3,option4 3 1 NA 1 1 7 option1,option4,option2 17 1 1 NA 1 8 option1,option4,option3,option2 10 1 1 1 1 9 option2 8 NA 1 NA NA 10 option2,option3 20 NA 1 1 NA 11 option2,option3,option4 18 NA 1 1 1 12 option3 1 NA NA 1 NA 13 option3 12 NA NA 1 NA 14 option3,option1 4 1 NA 1 NA 15 option3,option2,option4 11 NA 1 1 1 16 option3,option4,option2,option1 19 1 1 1 1 17 option4 7 NA NA NA 1 18 option4 13 NA NA NA 1 19 option4 15 NA NA NA 1 20 option4,option2,option3 16 NA 1 1 1 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 mutate 和 case_when 的用户定义函数 - User defined function using mutate & case_when 在 R 中使用 Group by 和 mutate、case_when、any() 和 all() function - Use Group by with mutate, case_when, any() and all() function in R 尝试在 R 中的变异动词内执行 case_when function - Trying to perform a case_when function inside a mutate verb in R 变异,case_when,粘贴到 R - Mutate, case_when, paste in R R 与 function 发生变异,case_when 和数据屏蔽以解析时间戳 - R mutate across with function, case_when and data masking to parse timestamps r- 尝试将 mutate 与 case_when 一起使用时出错 - r- Error when trying to use mutate with case_when 使用 mutate 和 case_when (R) 通过多个条件创建新变量的函数 - Function to create new variable by multiple conditions using mutate and case_when (R) 希望优化 R 中的 mutate(case_when( )) function,在系统命名的变量列表中需要相同的突变 - Looking to optimize a mutate(case_when( )) function in R, with the same mutations required across a list of systematically named variables 根据从不同列获得的值创建新列,使用 R 中的 mutate() 和 case_when 函数 - Creating a new column based on values obtained from different column, using mutate() and case_when function in R 变异、跨越和 case_when - Mutate, across, and case_when
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM