简体   繁体   中英

Cumulative summing between groups using dplyr

I have a tibble structured as follows:

   day  theta
1   1    2.1
2   1    2.1
3   2    3.2
4   2    3.2
5   5    9.5
6   5    9.5
7   5    9.5

Note that the tibble contains multiple rows for each day , and for each day the same value for theta is repeated an arbitrary number of times. (The tibble contains other arbitrary columns necessitating this repeating structure.)

I'd like to use dplyr to cumulatively sum values for theta across days such that, in the example above, 2.1 is added only a single time to 3.2 , etc. The tibble would be mutated so as to append the new cumulative sum ( c.theta ) as follows:

   day  theta  c.theta
1   1    2.1     2.1
2   1    2.1     2.1
3   2    3.2     5.3
4   2    3.2     5.3
5   5    9.5     14.8
6   5    9.5     14.8
7   5    9.5     14.8 
...

My initial efforts to group_by day and then cumsum over theta resulted only in cumulative summing over the full set of data (eg, 2.1 + 2.1 + 3.2 ... ) which is undesirable. In my Stack Overflow searches, I can find many examples of cumulative summing within groups, but never between groups, as I describe above. Nudges in the right direction would be much appreciated.

Doing this in dplyr I came up with a very similar solution to PoGibas - use distinct to just get one row per day, find the sum and merge back in:

df = read.table(text="day  theta
1   1    2.1
2   1    2.1
3   2    3.2
4   2    3.2
5   5    9.5
6   5    9.5
7   5    9.5", header = TRUE)

cumsums = df %>%
    distinct(day, theta) %>%
    mutate(ctheta = cumsum(theta))

df %>%
    left_join(cumsums %>% select(day, ctheta), by = 'day')

Not a dplyr , but just an alternative data.table solution:

library(data.table)
# Original table is called d
setDT(d)
merge(d, unique(d)[, .(c.theta = cumsum(theta), day)])

   day theta c.theta
1:   1   2.1     2.1
2:   1   2.1     2.1
3:   2   3.2     5.3
4:   2   3.2     5.3
5:   5   9.5    14.8
6:   5   9.5    14.8
7:   5   9.5    14.8

PS: If you want to preserve other columns you have to use unique(d[, .(day, theta)])

In base R you could use split<- and tapply to return the desired result.

# construct 0 vector to fill in
dat$temp <- 0
# fill in with cumulative sum for each day
split(dat$temp, dat$day) <- cumsum(tapply(dat$theta, dat$day, head, 1))

Here, tapply returns the first element of theta for each day which is is fed to cumsum . The elements of cumulative sum are fed to each day using split<- .

This returns

dat
  day theta temp
1   1   2.1  2.1
2   1   2.1  2.1
3   2   3.2  5.3
4   2   3.2  5.3
5   5   9.5 14.8
6   5   9.5 14.8
7   5   9.5 14.8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM