简体   繁体   English

以编程方式将不同的函数应用于 data.table R 中的不同列

[英]Apply different functions to different columns programmatically in data.table R

I need to programmatically apply different functions to different columns and group by, using data.table .我需要使用data.table以编程方式将不同的函数应用于不同的列和分组data.table

If the columns and functions were known, I would do like this:如果列和函数是已知的,我会这样做:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
DT
#>    id v1 v2
#> 1:  a  2  5
#> 2:  a  2  5
#> 3:  a  2  5
#> 4:  b  3 10
#> 5:  b  3 10
#> 6:  b  3 10
#> 7:  c  4 15
#> 8:  c  4 15
#> 9:  c  4 15
DT[, .(v1=mean(v1), v2=sum(v2)), keyby=.(id)]
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

But I want to do this by passing the column names and their specific function:但我想通过传递列名及其特定功能来做到这一点:

aggregate_functions = list(v1=mean, v2=sum)
col_selection = c('v1', 'v2')

I wrote something like this by I can't figure out a way of passing the column name to lapply :我写了这样的东西,我想不出将列名传递给lapply

DT[, lapply(.SD, 
            aggregate_functions[[col_name]] # some way of selecting the right function from aggregate_functions
            ), 
   .SDcols = col_selection, 
   by=id]

I have also tried with melt and dcast , but the latter applies all the functions to all the columns:我也试图与meltdcast ,但后者适用于所有的功能,所有列:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
DTm = melt(DT, meaure.vars=col_selection, id.vars='id')
DTm
#>     id variable value
#>  1:  a       v1     2
#>  2:  a       v1     2
#>  3:  a       v1     2
#>  4:  b       v1     3
#>  5:  b       v1     3
#>  6:  b       v1     3
#>  7:  c       v1     4
#>  8:  c       v1     4
#>  9:  c       v1     4
#> 10:  a       v2     5
#> 11:  a       v2     5
#> 12:  a       v2     5
#> 13:  b       v2    10
#> 14:  b       v2    10
#> 15:  b       v2    10
#> 16:  c       v2    15
#> 17:  c       v2    15
#> 18:  c       v2    15
DTc = dcast(DTm, id~variable, fun.aggregate=list(sum, mean))
DTc
#>    id value_sum_v1 value_sum_v2 value_mean_v1 value_mean_v2
#> 1:  a            6           15             2             5
#> 2:  b            9           30             3            10
#> 3:  c           12           45             4            15

I could programmatically select and rename the relevant columns (3 and 4 in this case) but it doesn't look like an efficient approach.我可以以编程方式选择和重命名相关列(在本例中为 3 和 4),但这看起来不是一种有效的方法。

Of course I could have a loop doing the job and merging the results, but I am looking for a data.table way.当然,我可以有一个循环来完成这项工作并合并结果,但我正在寻找一种data.table方式。

Thank you for your answer and thank you to the team at data.table .感谢您的回答并感谢data.table的团队。

Created on 2019-11-26 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2019 年 11 月 26 日创建

After I posted the question, a link to this answer by @Uwe appeared in the right box that holds the results I am looking for.在我发布问题后, @Uwe提供的答案的链接出现在右侧框中,其中包含我正在寻找的结果。 I tweaked it to match my example:我调整了它以匹配我的示例:

library(magrittr)
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
aggregate_functions = list(v1='mean', v2='sum')
col_selection = c('v1', 'v2')
aggregate_functions %>%
  names() %>% 
  lapply(
    function(col_selection) lapply(
      aggregate_functions[[col_selection]], 
      function(.fct) sprintf("%s = %s(%s)", col_selection, .fct, col_selection))) %>% 
  unlist() %>% 
  paste(collapse = ", ") %>% 
  sprintf("DT[, .(%s), by = id]", .) %>% 
  parse(text = .) %>% 
  eval()
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

I would still be interested in 'all in data.table ' solutions.我仍然会对“all in data.table ”解决方案感兴趣。

Created on 2019-11-26 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2019 年 11 月 26 日创建

An option is to use mapply :一种选择是使用mapply

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
    .SDcols=col_selection]

Need to careful on the ordering of col_selection and aggregate_functions so that the right function is applied to the right column.需要注意col_selectionaggregate_functions的顺序,以便将正确的函数应用于正确的列。

output:输出:

   id v1 v2
1:  a  2 15
2:  b  3 30
3:  c  4 45

Edit from the OP :从 OP 编辑

Just to complete this brilliant solution.只是为了完成这个出色的解决方案。 This solution works very well and if we replace col_selection with names(aggregate_functions) there is no issue with the ordering.这个解决方案效果很好,如果我们用names(aggregate_functions)替换col_selection ,则排序没有问题。 Plus it automatically discards all the columns that are not in the list:此外,它会自动丢弃不在列表中的所有列:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3),
                id2 = c(rep(c('cc', 'dd'), 4), 'dd')
                )
aggregate_functions = list(v1=mean, v2=sum)
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
   .SDcols=names(aggregate_functions)]
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

It is also possible to use multiple variables to aggregate by, by passing a list:也可以通过传递一个列表来使用多个变量进行聚合:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), list(id, id2), 
   .SDcols=names(aggregate_functions)]
#>    id id2 v1 v2
#> 1:  a  cc  2 10
#> 2:  a  dd  2  5
#> 3:  b  dd  3 20
#> 4:  b  cc  3 10
#> 5:  c  cc  4 15
#> 6:  c  dd  4 30

Created on 2019-11-27 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2019 年 11 月 27 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM