如何加快分组数据帧中的跨列计算

Question

--- Crossposting from rstudio community forum for potential solutions outside of tidyverse . ---来自rstudio社区论坛的交叉发表的tidyverse之外潜在解决方案。

The basic situation is that the calculations are independent between groups, but each group need to be fed some arguments calculated from itself. 基本情况是，计算在组之间是独立的，但是每个组都需要提供一些根据自身计算的参数。 The trivial example is to find the index of the first element that's less than half of the column maxima. 一个简单的例子是找到小于列最大值的一半的第一个元素的索引。 The only twist is that one column X need to use maxima calculated through the others A, B, C . 唯一的问题是，一列X需要使用通过其他A, B, C计算得出的最大值。

I have a solution using group_map (similar to do ) for my question on grouped calculation . 我有一个使用group_map的解决方案（类似于do ）来解决关于分组计算的问题。 But the performance does not appear to be optimal. 但是性能似乎不是最佳的。 It seems that summarise_at takes much longer when used with group_map (compared to timings w/o it) 看来， summarise_at与使用时需要更长的时间group_map （相比于定时W / O它）

library(tidyverse)

times <- 1e5
cols <- 4
df3 <- as.data.frame(x = matrix(rnorm(times * cols, mean = 5), ncol = cols)) %>% 
   rename(A = V1, B = V2, C = V3, X = V4)

df3 <- cbind(grp = rep(seq_len(1e3), each = 100), df3) %>% 
   group_by(grp)

system.time(
  df3 %>% 
    group_map(~
    { 
      all_max <- summarise_at(., vars(A:C), max) %>% mutate(X = rowMeans(.))
      map2_df(., all_max, ~match(TRUE, .x < 0.5 * .y))
    }
    )
)
#>    user  system elapsed 
#>    3.87    0.00    3.98

system.time(
  df3 %>% summarise_at(vars(A:C), max) %>% mutate(X = rowMeans(.))
)
#>    user  system elapsed 
#>    0.02    0.00    0.01

system.time(
  df3 %>% summarise_at(vars(A:X), ~match(TRUE, . < 0.5 * max(.)))
)  
#>    user  system elapsed 
#>    0.25    0.02    0.26

^{Created on 2019-04-05 by the reprex package (v0.2.1)} ^{由reprex软件包（v0.2.1）创建于2019-04-05}

Any idea to improve the performance? 有提高性能的想法吗？ It seems that most functions are column based and I have not yet find a solution to do this simple task efficiently. 似乎大多数功能都是基于列的，我还没有找到有效地完成此简单任务的解决方案。

Answer 1

From what I can tell this accomplishes the same as your code in less than half a second on my machine: 据我所知，这可以在不到半秒的时间内完成与您的代码相同的操作：

library(data.table)
DT = as.data.table(matrix(rnorm(times * cols, mean = 5), times, cols))
setnames(DT, c('A', 'B', 'C', 'X'))
DT[ , grp := rep(seq_len(1e3), each = 100)]

setkey(DT, grp)

DT[DT[ , lapply(.SD, max), keyby = grp, .SDcols = !'X'
       ][ , X := Reduce(`+`, .SD)/ncol(.SD), .SDcols = !'grp'], {
  i.A; i.B; i.C; i.X
  lapply(names(.SD), function(j) 
    which.max(eval(as.name(j)) < .5 * eval(as.name(paste0('i.', j)))))
}, on = 'grp', by = .EACHI, .SDcols = !'grp']
#        grp V1 V2 V3 V4
#    1:    1  3 30  1  4
#    2:    2  6 15  4 10
#    3:    3  2  5  7  2
#    4:    4  8 16  5  8
#    5:    5 10  3  1  7
#   ---                 
#  996:  996 11  5  3 13
#  997:  997  3  3  3 11
#  998:  998 14 21  2 10
#  999:  999 18  2  1 41
# 1000: 1000  8  7  3  3

Essentially, you are creating a look-up table of the relevant caps and joining back. 本质上，您是在创建有关上限的查找表并重新加入。

You could separate this by writing: 您可以这样写：

lookup = 
  DT[ , lapply(.SD, max), keyby = grp, .SDcols = !'X'
     ][ , X := Reduce(`+`, .SD)/ncol(.SD), .SDcols = !'grp']
DT[lookup, on = 'grp', {
  i.A; i.B; i.C; i.X
  lapply(names(.SD), function(j) 
    which.max(eval(as.name(j)) < .5 * eval(as.name(paste0('i.', j)))))
}, by = .EACHI, .SDcols = !'grp']

Once it's separated, you also gain the flexibility of getting get (which in my experience is slower than eval(as.name()) ): 分离后，您还将获得获取get的灵活性（以我的经验，它比eval(as.name())慢）：

DT[lookup, on = 'grp', {
  lapply(names(.SD), function(j) 
    which.max(eval(as.name(j)) < .5 * get(paste0('i.', j))))
}, by = .EACHI, .SDcols = !'grp']
#        grp V1 V2 V3 V4
#    1:    1  1  5 26  3
#    2:    2  6  7  3  4
#    3:    3  2  6  1 13
#    4:    4  5  2 12  5
#    5:    5  9 12  2  4
#   ---                 
#  996:  996  1  3  4  1
#  997:  997  1  6  3 13
#  998:  998 10 13  9  8
#  999:  999  2  4 13  4
# 1000: 1000  7 30 19 14

如何加快分组数据帧中的跨列计算

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-04-08 11:16:50

如何加快分组数据帧中的跨列计算

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-04-08 11:16:50

解决方案1
2 已采纳 2019-04-08 11:16:50