[英]Apply different functions to different columns programmatically in data.table R
I need to programmatically apply different functions to different columns and group by, using data.table
.我需要使用data.table
以编程方式将不同的函数应用于不同的列和分组data.table
。
If the columns and functions were known, I would do like this:如果列和函数是已知的,我会这样做:
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
v1 = rep(c(2, 3, 4), each=3),
v2 = rep(c(5, 10, 15), each=3))
DT
#> id v1 v2
#> 1: a 2 5
#> 2: a 2 5
#> 3: a 2 5
#> 4: b 3 10
#> 5: b 3 10
#> 6: b 3 10
#> 7: c 4 15
#> 8: c 4 15
#> 9: c 4 15
DT[, .(v1=mean(v1), v2=sum(v2)), keyby=.(id)]
#> id v1 v2
#> 1: a 2 15
#> 2: b 3 30
#> 3: c 4 45
But I want to do this by passing the column names and their specific function:但我想通过传递列名及其特定功能来做到这一点:
aggregate_functions = list(v1=mean, v2=sum)
col_selection = c('v1', 'v2')
I wrote something like this by I can't figure out a way of passing the column name to lapply
:我写了这样的东西,我想不出将列名传递给lapply
:
DT[, lapply(.SD,
aggregate_functions[[col_name]] # some way of selecting the right function from aggregate_functions
),
.SDcols = col_selection,
by=id]
I have also tried with melt
and dcast
, but the latter applies all the functions to all the columns:我也试图与melt
和dcast
,但后者适用于所有的功能,所有列:
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
v1 = rep(c(2, 3, 4), each=3),
v2 = rep(c(5, 10, 15), each=3))
DTm = melt(DT, meaure.vars=col_selection, id.vars='id')
DTm
#> id variable value
#> 1: a v1 2
#> 2: a v1 2
#> 3: a v1 2
#> 4: b v1 3
#> 5: b v1 3
#> 6: b v1 3
#> 7: c v1 4
#> 8: c v1 4
#> 9: c v1 4
#> 10: a v2 5
#> 11: a v2 5
#> 12: a v2 5
#> 13: b v2 10
#> 14: b v2 10
#> 15: b v2 10
#> 16: c v2 15
#> 17: c v2 15
#> 18: c v2 15
DTc = dcast(DTm, id~variable, fun.aggregate=list(sum, mean))
DTc
#> id value_sum_v1 value_sum_v2 value_mean_v1 value_mean_v2
#> 1: a 6 15 2 5
#> 2: b 9 30 3 10
#> 3: c 12 45 4 15
I could programmatically select and rename the relevant columns (3 and 4 in this case) but it doesn't look like an efficient approach.我可以以编程方式选择和重命名相关列(在本例中为 3 和 4),但这看起来不是一种有效的方法。
Of course I could have a loop doing the job and merging the results, but I am looking for a data.table
way.当然,我可以有一个循环来完成这项工作并合并结果,但我正在寻找一种data.table
方式。
Thank you for your answer and thank you to the team at data.table
.感谢您的回答并感谢data.table
的团队。
Created on 2019-11-26 by the reprex package (v0.3.0)由reprex 包(v0.3.0) 于 2019 年 11 月 26 日创建
After I posted the question, a link to this answer by @Uwe appeared in the right box that holds the results I am looking for.在我发布问题后, @Uwe提供的此答案的链接出现在右侧框中,其中包含我正在寻找的结果。 I tweaked it to match my example:我调整了它以匹配我的示例:
library(magrittr)
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
v1 = rep(c(2, 3, 4), each=3),
v2 = rep(c(5, 10, 15), each=3))
aggregate_functions = list(v1='mean', v2='sum')
col_selection = c('v1', 'v2')
aggregate_functions %>%
names() %>%
lapply(
function(col_selection) lapply(
aggregate_functions[[col_selection]],
function(.fct) sprintf("%s = %s(%s)", col_selection, .fct, col_selection))) %>%
unlist() %>%
paste(collapse = ", ") %>%
sprintf("DT[, .(%s), by = id]", .) %>%
parse(text = .) %>%
eval()
#> id v1 v2
#> 1: a 2 15
#> 2: b 3 30
#> 3: c 4 45
I would still be interested in 'all in data.table
' solutions.我仍然会对“all in data.table
”解决方案感兴趣。
Created on 2019-11-26 by the reprex package (v0.3.0)由reprex 包(v0.3.0) 于 2019 年 11 月 26 日创建
An option is to use mapply
:一种选择是使用mapply
:
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id,
.SDcols=col_selection]
Need to careful on the ordering of col_selection
and aggregate_functions
so that the right function is applied to the right column.需要注意col_selection
和aggregate_functions
的顺序,以便将正确的函数应用于正确的列。
output:输出:
id v1 v2
1: a 2 15
2: b 3 30
3: c 4 45
Edit from the OP :从 OP 编辑:
Just to complete this brilliant solution.只是为了完成这个出色的解决方案。 This solution works very well and if we replace col_selection
with names(aggregate_functions)
there is no issue with the ordering.这个解决方案效果很好,如果我们用names(aggregate_functions)
替换col_selection
,则排序没有问题。 Plus it automatically discards all the columns that are not in the list:此外,它会自动丢弃不在列表中的所有列:
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
v1 = rep(c(2, 3, 4), each=3),
v2 = rep(c(5, 10, 15), each=3),
id2 = c(rep(c('cc', 'dd'), 4), 'dd')
)
aggregate_functions = list(v1=mean, v2=sum)
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id,
.SDcols=names(aggregate_functions)]
#> id v1 v2
#> 1: a 2 15
#> 2: b 3 30
#> 3: c 4 45
It is also possible to use multiple variables to aggregate by, by passing a list:也可以通过传递一个列表来使用多个变量进行聚合:
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), list(id, id2),
.SDcols=names(aggregate_functions)]
#> id id2 v1 v2
#> 1: a cc 2 10
#> 2: a dd 2 5
#> 3: b dd 3 20
#> 4: b cc 3 10
#> 5: c cc 4 15
#> 6: c dd 4 30
Created on 2019-11-27 by the reprex package (v0.3.0)由reprex 包(v0.3.0) 于 2019 年 11 月 27 日创建
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.