[英]Perform a cumulative group operations with R and dplyr
I'm trying to process data based on a sequential group id. 我正在尝试根据顺序的组ID处理数据。 There are J groups and I want to run the data processing function for groups i < j=1..J
有J组,我想为组i < j=1..J
运行数据处理功能
The most trivial case is when each row is it's own group and you calculate the cumulative sum. 最琐碎的情况是每一行都是它自己的组,然后您计算累积总和。 However I have multiple rows in each group and the processing is more complicated than summation. 但是,我在每个组中都有多行,并且处理比求和更为复杂。
Here is an minimal example of my data format: 这是我的数据格式的最小示例:
row | group | value
----|-------|------
1 | 1 | 2065
2 | 1 | 2075
3 | 2 | 18008
4 | 2 | 17655
: | : | :
N-1 | J-1 | 2345
N | J | 5432
One solution I've thought of is to replicate my data, stacking it and reassigning the groups in each data so that group i<j
to j. 我想到的一种解决方案是复制数据,将其堆叠并在每个数据中重新分配组,以使组i<j
到j。 This would result in a very long data frame like such: 这将导致非常长的数据帧,例如:
row | group | value
----|-------|------
1 | 1 | 2065
2 | 1 | 2075
3 | 2 | 2065
4 | 2 | 2075
5 | 2 | 18008
6 | 2 | 17655
: | : | :
However this seems tedious and inefficient as my data will be copied many times. 但是,这似乎乏味且效率低下,因为我的数据将被多次复制。
Does anyone know of a more efficient way of processing the data in a cumulative group by way? 有谁知道一种更有效的方式来处理累积组中的数据?
Here are three example, one with aggregate
, one with data.table
and the last one with dplyr
as you asked. 这是三个示例,其中一个具有aggregate
,一个具有data.table
,最后一个具有dplyr
。
First create the dataframe 首先创建数据框
library(data.table)
library(dplyr)
group <- c(1,1,2,2,3)
value <- c(2065, 2075, 18008, 17655, 561)
With data.table you can use this function 使用data.table可以使用此功能
dat <- data.table(group, value)
recap <- dat[, list(somma = sum(value)), by = group]
With aggregate from the package stats 包含包装统计信息中的汇总
dat <- data.frame(group, value)
aggregate(dat$value, by=list(Group=dat$group), FUN=sum)
Then with dplyr 然后与dplyr
dat %>%
group_by(group) %>%
summarise(result = sum(value))
These will give you 这些会给你
group | result
---------------
1 | 4140
2 | 35663
3 | 561
One methodology that should work here is to split the data.frame by group id, and then run a for
loop (or lapply
) with the the cumulative groups. 这里应采用的一种方法是按组ID拆分data.frame,然后使用累积组运行for
循环(或lapply
)。 Below is an example using a for
loop as I think it is will be more straightforward to implement. 下面是使用for
循环的示例for
因为我认为它的实现会更加简单。
# split data.frame by group ID
myList <- split(df, df$group)
# initialize empty output list
myOutputList <- list()
# loop through group IDs, including the next one
for(i in seq_along(unique(df$group))) {
# create temporary df for analysis
myTempDf <- do.call(rbind, myList[seq_len(i)])
## perform analysis on myTempDf here ##
# save results
myOutputList[[i]] <- list(<list of analysis ouput>)
}
The output would be a nested list. 输出将是一个嵌套列表。 I'd recommend naming each item in the nested list to make it easier to access, like myOutputList[[i]][["regression.1"]]
. 我建议命名嵌套列表中的每个项目,以使其易于访问,例如myOutputList[[i]][["regression.1"]]
。
Note that this assumes that the groups are properly sorted properly in the original data.frame and that the group ids are the counting numbers 1,2,3,4,... as in your example. 请注意,这是假设在原始data.frame中正确地对组进行了正确排序,并且组id是计数数字1,2,3,4,...,在您的示例中。
Here are several approaches: 以下是几种方法:
1) sqldf This is being transferred from the comments. 1)sqldf这是从注释转移过来的。 I had originally put it there since it is not a dplyr solution but it seems you are considering others. 我最初将其放在此处是因为它不是dplyr解决方案,但似乎您正在考虑其他解决方案。 We join the unique group values with the data frame on the indicated condition. 在指定的条件下,我们将唯一组值与数据框结合在一起。 A single SQL statement will do it: 只需一条SQL语句即可:
DF <- data.frame(group = c(1, 1, 2, 2), value = 1:4) # test data
library(sqldf)
outDF <- sqldf("select a.[group], b.value
from
(select distinct [group] from DF) a
join DF b on a.[group] >= b.[group]")
giving: 赠送:
> outDF
group value
1 1 1
2 1 2
3 2 1
4 2 2
5 2 3
6 2 4
and now we can process over the groups. 现在我们可以处理组了。 Depending on what fun
looks like one of these might do it: 根据看似fun
事情,可以选择其中之一:
aggregate(value ~ group, outDF, fun)
tapply(outDF$value, outDF$group, fun)
by(outDF, outDF$group, fun)
ave(outDF$value, outDF$group, FUN = fun)
If the operation were sum, say, rather than a separate aggregation it could be combined with the above like this. 例如,如果运算是求和运算,而不是单独的聚合,则可以将其与上述运算结合起来。
sqldf("select a.[group], sum(b.value) cumsum
from (select distinct [group] from DF) a join DF b on a.[group] >= b.[group]
group by a.[group]")
giving: 赠送:
group cumsum
1 1 3
2 2 10
Note that 注意
group
is an SQL keyword which is why we escaped it using [group]
group
是一个SQL关键字,这就是我们使用[group]
对其进行转义的原因
we have assumed that it is desired to accumulate groups that are numerically equal or less than the current group which is the case in the example in the question. 我们假设需要累积在数值上等于或小于当前组的组,这在问题示例中就是这种情况。 If a different order were desired we could create another grouping variable whose ordering reflected that desired. 如果需要不同的顺序,我们可以创建另一个分组变量,其顺序反映出所需的顺序。
2) base This does not use any packages. 2)base这不使用任何包。 We have assumed that it is desired to accumulate the current group and groups that appear prior to it in the split so that groups are accumulated in numerical order; 我们假设希望对当前组和在拆分中出现在其之前的组进行累加,以便按数字顺序累加组。 however, if we want a different order we could make group
into a factor and order the levels as desired since split
output will be in the order of the grouping factor's levels. 但是,如果我们希望使用不同的顺序,则可以将group
划分为一个因子,然后根据需要对级别进行排序,因为split
输出将按照分组因子的级别进行排序。
L <- Reduce(rbind, split(DF, DF$group), acc = TRUE)
do.call("rbind", lapply(L, transform, group = tail(group, 1)))
giving: 赠送:
group value
1 1 1
2 1 2
3 2 1
4 2 2
5 2 3
6 2 4
3) magrittr (2) can be rewritten using magrittr like this: 3)可以使用magrittr重写magrittr (2),如下所示:
library(magrittr)
DF %>%
split(.$group) %>%
Reduce(f = rbind, acc = TRUE) %>%
lapply(transform, group = tail(group, 1)) %>%
do.call(what = "rbind")
giving the same result as in (2). 得到与(2)中相同的结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.