简体   繁体   English

通过R中的因子向量化cumsum

[英]vectorize cumsum by factor in R

I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates the cumulative sum of 1's for each factor level, and resets when a new factor level is reached. 我试图在一个非常大的数据框(约220万行)中创建一个列,计算每个因子级别的1的累积和,并在达到新的因子级别时重置。 Below is some basic data that resembles my own. 以下是一些类似于我自己的基本数据。

itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6')
goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1)
df <- data.frame(itemcode, goodp)

I would like the output variable, cum.goodp, to look like this: 我想输出变量cum.goodp看起来像这样:

cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)

I get that there is a lot out there using the canonical split-apply-combine approach, which, conceptually is intuitive, but I tried using the following: 我知道那里有很多使用规范的split-apply-combine方法,从概念上讲它是直观的,但我尝试使用以下方法:

k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))

When I try to run this code it's very very slow. 当我尝试运行此代码时,它非常慢。 I get that transform is part of the reason why (the 'by' doesn't help either). 我得到的变化是其中一部分原因('by'也没有帮助)。 There are over 70K different values for the itemcode variable, so it should probably be vectorized. itemcode变量有超过70K的不同值,因此它应该是矢量化的。 Is there a way to vectorize this, using cumsum? 有没有办法使用cumsum对其进行矢量化? If not, any help whatsoever would be truly appreciated. 如果没有,任何帮助都将得到真正的赞赏。 Thanks so much. 非常感谢。

A base R approach is to calculate cumsum over the whole vector, and capture the geometry of the sub-lists using run-length encoding. 基本R方法是在整个向量上计算cumsum,并使用行程编码捕获子列表的几何。 Figure out the start of each group, and create new groups 找出每个组的开始,并创建新组

start <- c(TRUE, itemcode[-1] != itemcode[-length(itemcode)]) | !goodp
f <- cumsum(start)

Summarize these as a run-length encoding, and calculate the overall sum 将它们总结为行程编码,并计算总和

r <- rle(f)
x <- cumsum(x)

Then use the geometry to get the offset that each embedded sum needs to be corrected by 然后使用几何来获得每个嵌入总和需要校正的偏移量

offset <- c(0, x[cumsum(r$lengths)])

and calculate the updated value 并计算更新的值

x - rep(offset[-length(offset)], r$lengths)

Here's a function 这是一个功能

cumsumByGroup <- function(x, f) {
    start <- c(TRUE, f[-1] != f[-length(f)]) | !x
    r <- rle(cumsum(start))
    x <- cumsum(x)
    offset <- c(0, x[cumsum(r$lengths)])
    x - rep(offset[-length(offset)], r$lengths)
}

Here's the result applied to the sample data 这是应用于样本数据的结果

> cumsumByGroup(goodp, itemcode)
 [1] 0 1 2 0 1 1 2 0 0 1 1 1 2 0 1

and it's performance 它的表现

> n <- 1 + rpois(1000000, 1)
> goodp <- sample(c(0, 1), sum(n), TRUE)
> itemcode <- rep(seq_along(n), n)
> system.time(cumsumByGroup(goodp, itemcode))
   user  system elapsed 
   0.55    0.00    0.55 

The dplyr solution takes about 70s. dplyr解决方案需要大约70秒。

@alexis_laz solution is both elegant and 2 times faster than mine @alexis_laz解决方案既优雅又比我快2倍

cumsumByGroup1 <- function(x, f) {
    start <- c(TRUE, f[-1] != f[-length(f)]) | !x
    cs = cumsum(x)
    cs - cummax((cs - x) * start)
}

With the modified example input/output you could use the following base R approach (among others): 使用修改后的示例输入/输出,您可以使用以下基本R方法(以及其他方法):

transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))
#   itemcode goodp cum.goodp cum.goodpX
#1        a1     0         0          0
#2        a1     1         1          1
#3        a1     1         2          2
#4        a1     0         0          0
#5        a1     1         1          1
#6        a2     1         1          1
#7        a2     1         2          2
#8        a3     0         0          0
#9        a4     0         0          0
#10       a4     1         1          1
#11       a5     1         1          1
#12       a6     1         1          1
#13       a6     1         2          2
#14       a6     0         0          0
#15       a6     1         1          1

Note: I added column cum.goodp to the input df and created a new column cum.goodpX so you can easily compare the two. 注意:我将列cum.goodp添加到输入df并创建了一个新列cum.goodpX以便您可以轻松地比较两者。

But of course you can use many other approaches with packages, either what @MartinMorgan suggested or for example using dplyr or data.table, to name just two options. 但是当然你可以使用许多其他方法来处理软件包,无论是@MartinMorgan建议的还是使用dplyr或data.table,只列举两个选项。 Those may be a lot faster than base R approaches for large data sets. 对于大型数据集,这些可能比基本R方法快得多。

Here's how it would be done in dplyr: 以下是在dplyr中完成的方法:

library(dplyr)
df %>% 
   group_by(itemcode, grp = cumsum(goodp == 0)) %>% 
   mutate(cum.goodpX = cumsum(goodp))

A data.table option was already provided in the comments to your question. 您的问题的评论中已经提供了data.table选项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM