简体   繁体   English

使用R和dplyr执行累积组操作

[英]Perform a cumulative group operations with R and dplyr

I'm trying to process data based on a sequential group id. 我正在尝试根据顺序的组ID处理数据。 There are J groups and I want to run the data processing function for groups i < j=1..J 有J组,我想为组i < j=1..J运行数据处理功能

The most trivial case is when each row is it's own group and you calculate the cumulative sum. 最琐碎的情况是每一行都是它自己的组,然后您计算累积总和。 However I have multiple rows in each group and the processing is more complicated than summation. 但是,我在每个组中都有多行,并且处理比求和更为复杂。

Here is an minimal example of my data format: 这是我的数据格式的最小示例:

row | group | value
----|-------|------
  1 |     1 |  2065
  2 |     1 |  2075
  3 |     2 | 18008
  4 |     2 | 17655
  : |     : |     :
N-1 |   J-1 |  2345
  N |     J |  5432

One solution I've thought of is to replicate my data, stacking it and reassigning the groups in each data so that group i<j to j. 我想到的一种解决方案是复制数据,将其堆叠并在每个数据中重新分配组,以使组i<j到j。 This would result in a very long data frame like such: 这将导致非常长的数据帧,例如:

row | group | value
----|-------|------
  1 |     1 |  2065
  2 |     1 |  2075
  3 |     2 |  2065
  4 |     2 |  2075
  5 |     2 | 18008
  6 |     2 | 17655
  : |     : |     :

However this seems tedious and inefficient as my data will be copied many times. 但是,这似乎乏味且效率低下,因为我的数据将被多次复制。

Does anyone know of a more efficient way of processing the data in a cumulative group by way? 有谁知道一种更有效的方式来处理累积组中的数据?

Here are three example, one with aggregate , one with data.table and the last one with dplyr as you asked. 这是三个示例,其中一个具有aggregate ,一个具有data.table ,最后一个具有dplyr

First create the dataframe 首先创建数据框

library(data.table)
library(dplyr)

group <- c(1,1,2,2,3)
value <- c(2065, 2075, 18008, 17655, 561)

With data.table you can use this function 使用data.table可以使用此功能

dat <- data.table(group, value)
recap <- dat[, list(somma = sum(value)), by = group]

With aggregate from the package stats 包含包装统计信息中的汇总

dat <- data.frame(group, value)
aggregate(dat$value, by=list(Group=dat$group), FUN=sum)

Then with dplyr 然后与dplyr

dat %>%
    group_by(group) %>%
    summarise(result = sum(value))

These will give you 这些会给你

group | result
---------------
  1   |  4140
  2   |  35663
  3   |  561

One methodology that should work here is to split the data.frame by group id, and then run a for loop (or lapply ) with the the cumulative groups. 这里应采用的一种方法是按组ID拆分data.frame,然后使用累积组运行for循环(或lapply )。 Below is an example using a for loop as I think it is will be more straightforward to implement. 下面是使用for循环的示例for因为我认为它的实现会更加简单。

# split data.frame by group ID
myList <- split(df, df$group)
# initialize empty output list
myOutputList <- list()

# loop through group IDs, including the next one
for(i in seq_along(unique(df$group))) {
  # create temporary df for analysis
  myTempDf <- do.call(rbind, myList[seq_len(i)])

  ## perform analysis on myTempDf here ##

  # save results
  myOutputList[[i]] <- list(<list of analysis ouput>)
}

The output would be a nested list. 输出将是一个嵌套列表。 I'd recommend naming each item in the nested list to make it easier to access, like myOutputList[[i]][["regression.1"]] . 我建议命名嵌套列表中的每个项目,以使其易于访问,例如myOutputList[[i]][["regression.1"]]

Note that this assumes that the groups are properly sorted properly in the original data.frame and that the group ids are the counting numbers 1,2,3,4,... as in your example. 请注意,这是假设在原始data.frame中正确地对组进行了正确排序,并且组id是计数数字1,2,3,4,...,在您的示例中。

Here are several approaches: 以下是几种方法:

1) sqldf This is being transferred from the comments. 1)sqldf这是从注释转移过来的。 I had originally put it there since it is not a dplyr solution but it seems you are considering others. 我最初将其放在此处是因为它不是dplyr解决方案,但似乎您正在考虑其他解决方案。 We join the unique group values with the data frame on the indicated condition. 在指定的条件下,我们将唯一组值与数据框结合在一起。 A single SQL statement will do it: 只需一条SQL语句即可:

DF <- data.frame(group = c(1, 1, 2, 2), value = 1:4) # test data

library(sqldf)
outDF <- sqldf("select a.[group], b.value 
                from 
                     (select distinct [group] from DF) a 
                     join DF b on a.[group] >= b.[group]")

giving: 赠送:

> outDF
  group value
1     1     1
2     1     2
3     2     1
4     2     2
5     2     3
6     2     4

and now we can process over the groups. 现在我们可以处理组了。 Depending on what fun looks like one of these might do it: 根据看似fun事情,可以选择其中之一:

aggregate(value ~ group, outDF, fun)

tapply(outDF$value, outDF$group, fun)

by(outDF, outDF$group, fun)

ave(outDF$value, outDF$group, FUN = fun)

If the operation were sum, say, rather than a separate aggregation it could be combined with the above like this. 例如,如果运算是求和运算,而不是单独的聚合,则可以将其与上述运算结合起来。

sqldf("select a.[group], sum(b.value) cumsum
       from (select distinct [group] from DF) a join DF b on a.[group] >= b.[group] 
       group by a.[group]")

giving: 赠送:

  group cumsum
1     1      3
2     2     10

Note that 注意

  • group is an SQL keyword which is why we escaped it using [group] group是一个SQL关键字,这就是我们使用[group]对其进行转义的原因

  • we have assumed that it is desired to accumulate groups that are numerically equal or less than the current group which is the case in the example in the question. 我们假设需要累积在数值上等于或小于当前组的组,这在问题示例中就是这种情况。 If a different order were desired we could create another grouping variable whose ordering reflected that desired. 如果需要不同的顺序,我们可以创建另一个分组变量,其顺序反映出所需的顺序。

2) base This does not use any packages. 2)base这不使用任何包。 We have assumed that it is desired to accumulate the current group and groups that appear prior to it in the split so that groups are accumulated in numerical order; 我们假设希望对当前组和在拆分中出现在其之前的组进行累加,以便按数字顺序累加组。 however, if we want a different order we could make group into a factor and order the levels as desired since split output will be in the order of the grouping factor's levels. 但是,如果我们希望使用不同的顺序,则可以将group划分为一个因子,然后根据需要对级别进行排序,因为split输出将按照分组因子的级别进行排序。

L <- Reduce(rbind, split(DF, DF$group), acc = TRUE)
do.call("rbind", lapply(L, transform, group = tail(group, 1)))

giving: 赠送:

  group value
1     1     1
2     1     2
3     2     1
4     2     2
5     2     3
6     2     4

3) magrittr (2) can be rewritten using magrittr like this: 3)可以使用magrittr重写magrittr (2),如下所示:

library(magrittr)

DF %>%
  split(.$group) %>%
  Reduce(f = rbind, acc = TRUE) %>%
  lapply(transform, group = tail(group, 1)) %>%
  do.call(what = "rbind")

giving the same result as in (2). 得到与(2)中相同的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM