简体   繁体   中英

efficiently calculate annual cumulative sum

I have a dataset with quarterly transactions. PERIOD represents the quarter of the transaction and INCREM represents the incremental amounts.

tbl <- data.frame(PERIOD = c(2,3,6,10,11),
                  INCREM = c(10,50,-30,-10,-20))

I want to get annual cumulative sums (so the cumulative sum at periods 4, 8, 12).

library(dplyr)
library(tidyr)

tbl %>%
  mutate(CUMSUM = cumsum(INCREM)) %>%
  select(-INCREM) %>%
  mutate(PERIOD = factor(PERIOD, 1:12)) %>%
  complete(PERIOD) %>%
  fill(CUMSUM) %>%
  mutate(PERIOD = as.numeric(PERIOD)) %>%
  filter(PERIOD %% 4 == 0)

Result:

  PERIOD CUMSUM
1      4     60
2      8     30
3     12      0

This works, but it's not very efficient. The original dataset is 5 rows and the final dataset is 3 rows, but in the middle of the dplyr chain (after fill() ) the dataset is 12 rows.

Is there a more efficient way to get the annual cumulative sums?

Also, my actual data is coming from a database query. Do you think it would be better for me to take care of this cumulative summing in the SQL query before manipulating in R?

cut is definitely the way to go. You can also just calculate the cumulative sum and then keep the final rows of the period. This avoids the aggregate step.

tbl$prd <- cut(tbl$PERIOD, c(1,4,8,Inf), labels=c(4,8,12))
tbl$cumsum <- cumsum(tbl$INCREM)
tbl[!duplicated(tbl$prd, fromLast=TRUE),c("prd","cumsum")]
#   prd cumsum
# 2   4     60
# 3   8     30
# 5  12      0

As @thelatemail suggested you can use cut to create groups, then sum values in each group and finally cumsum over all the values.

library(dplyr)
tbl %>%
  group_by(quarter = cut(PERIOD, c(1,4,8,Inf), labels=c(4,8,12))) %>%
  summarise(CUMSUM = sum(INCREM)) %>%
  ungroup() %>%
  mutate(CUMSUM = cumsum(CUMSUM))

#  quarter CUMSUM
#  <fct>    <dbl>
#1   4       60
#2   8       30
#3  12        0

Using same logic an overly complicated base R approach to fit in one line is

transform(aggregate(INCREM~PERIOD, 
  transform(tbl, PERIOD = cut(PERIOD, c(1,4,8,Inf), labels=c(4,8,12))), sum), 
    INCREM = cumsum(INCREM))


#  PERIOD INCREM
#1      4     60
#2      8     30
#3     12      0

which actually means

tbl$PERIOD <- cut(tbl$PERIOD, c(1,4,8,Inf), labels=c(4,8,12))
tbl1 <- aggregate(INCREM~PERIOD, tbl, sum)
tbl1$INCREM <- cumsum(tbl1$INCREM)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM