[英]Group and mutate with function and conditional functions arguments in R
請考慮以下事項:
自定義函數CustomFun
接受多個數字參數。 參數名稱存儲在resp
並對應於函數參數名稱。 參數值存儲在 colum val
。
data.frame
包含幾個患者的信息( id
),因此數據需要按id
分組。
問題:
我們如何將自定義函數應用於分組data.frame
或data.table
,它們從同一數據結構中的列中獲取參數?
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
# The data
df.x <- data.frame(id = rep(c(1:2), each = 5),
resp = c("val.a", "val.b", "val.c", "val.d", "val.e"),
val = c(10, 15, NA, NA, NA,
1, 5, NA, NA, NA))
df.x
#> id resp val
#> 1 1 val.a 10
#> 2 1 val.b 15
#> 3 1 val.c NA
#> 4 1 val.d NA
#> 5 1 val.e NA
#> 6 2 val.a 1
#> 7 2 val.b 5
#> 8 2 val.c NA
#> 9 2 val.d NA
#> 10 2 val.e NA
# A simple function (minimal replicable example)
CustomFun <- function(a,b){
a+b
}
期望的輸出:
# Desired output
df.x %>% mutate(res = c(25, 25, NA, NA, NA, 6, 6, NA, NA, NA))
#> id resp val res
#> 1 1 val.a 10 25
#> 2 1 val.b 15 25
#> 3 1 val.c NA NA
#> 4 1 val.d NA NA
#> 5 1 val.e NA NA
#> 6 2 val.a 1 6
#> 7 2 val.b 5 6
#> 8 2 val.c NA NA
#> 9 2 val.d NA NA
#> 10 2 val.e NA NA
自己的做法:
當沒有組 ( id
) 時,此方法有效。 對於所有非val.a
或val.b
在val
沒有NA
不會有問題,因為它們可以在第二步中被過濾掉。
# Approach without the need of grouping: one id only, problem: NA also assigned to val in df.z[3:5, ]
# dplyr
df.z <- df.x %>% slice(1:5)
df.z
#> id resp val
#> 1 1 val.a 10
#> 2 1 val.b 15
#> 3 1 val.c NA
#> 4 1 val.d NA
#> 5 1 val.e NA
df.z %>% mutate(test = CustomFun(a = df.z %>% filter(resp == "val.a") %>% pull(val),
b = df.z %>% filter(resp == "val.b") %>% pull(val))
)
#> id resp val test
#> 1 1 val.a 10 25
#> 2 1 val.b 15 25
#> 3 1 val.c NA 25
#> 4 1 val.d NA 25
#> 5 1 val.e NA 25
# data.table
setDT(df.z)[, .(test= CustomFun(a = setDT(df.z)[resp == "val.a", val],
b = setDT(df.z)[resp == "val.b", val])),
by = .(id, val, resp)]
#> id val resp test
#> 1: 1 10 val.a 25
#> 2: 1 15 val.b 25
#> 3: 1 NA val.c 25
#> 4: 1 NA val.d 25
#> 5: 1 NA val.e 25
# NOT working for groups =====================================
# data.frame
df.x %>%
group_by(id) %>%
mutate(test = CustomFun(a = df.x %>% filter(resp == "val.a") %>% pull(val),
b = df.x %>% filter(resp == "val.b") %>% pull(val))
)
#> Error in mutate_impl(.data, dots): Column `test` must be length 5 (the group size) or one, not 2
# data.table
setDT(df.x)[, .(test= CustomFun(a = setDT(df.x)[resp == "val.a", val],
b = setDT(df.x)[resp == "val.b", val])),
by = .(id, val, resp)]
#> id val resp test
#> 1: 1 10 val.a 25
#> 2: 1 10 val.a 6
#> 3: 1 15 val.b 25
#> 4: 1 15 val.b 6
#> 5: 1 NA val.c 25
#> 6: 1 NA val.c 6
#> 7: 1 NA val.d 25
#> 8: 1 NA val.d 6
#> 9: 1 NA val.e 25
#> 10: 1 NA val.e 6
#> 11: 2 1 val.a 25
#> 12: 2 1 val.a 6
#> 13: 2 5 val.b 25
#> 14: 2 5 val.b 6
#> 15: 2 NA val.c 25
#> 16: 2 NA val.c 6
#> 17: 2 NA val.d 25
#> 18: 2 NA val.d 6
#> 19: 2 NA val.e 25
#> 20: 2 NA val.e 6
由reprex 包(v0.2.1) 於 2018 年 11 月 13 日創建
非常感謝!
有兩個不同的問題:您在data.table
添加了data.table
分組變量,並且您在兩個版本中都錯誤地對數據進行了子集化。
data.table
調整:
setDT(df.x)[!is.na(val), test := CustomFun(a = val[resp == "val.a"],
b = val[resp == "val.b"]), by = id]
無需按resp
和val
分組,只需按id
分組。
對於dplyr
,你可以這樣做:
df.x %>%
group_by(id) %>%
mutate(test = if_else(!is.na(val), CustomFun(a = val[resp == "val.a"],
b = val[resp == "val.b"]), NA_real_)
)
兩種情況下的輸出:
id resp val test
1: 1 val.a 10 25
2: 1 val.b 15 25
3: 1 val.c NA NA
4: 1 val.d NA NA
5: 1 val.e NA NA
6: 2 val.a 1 6
7: 2 val.b 5 6
8: 2 val.c NA NA
9: 2 val.d NA NA
10: 2 val.e NA NA
我們可以按組對值進行子集化(假設每個“id”只有一個“val.a”、“val.b”,然后添加
library(dplyr)
df.x %>%
group_by(id) %>%
mutate(res = (val[resp == 'val.a'] + val[resp == 'val.b']) * NA^(is.na(val)))
# A tibble: 10 x 4
# Groups: id [2]
# id resp val res
# <int> <fct> <dbl> <dbl>
# 1 1 val.a 10 25
# 2 1 val.b 15 25
# 3 1 val.c NA NA
# 4 1 val.d NA NA
# 5 1 val.e NA NA
# 6 2 val.a 1 6
# 7 2 val.b 5 6
# 8 2 val.c NA NA
# 9 2 val.d NA NA
#10 2 val.e NA NA
或者另一種選擇是filter
,按組進行summarize
,然后加入原始數據集
df.x %>%
filter(resp %in% c('val.a', 'val.b')) %>%
group_by(id) %>%
summarise(res = sum(val)) %>%
right_join(df.x) %>%
mutate(res = replace(res, is.na(val), NA))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.