简体   繁体   中英

How to group and summarise by two variables

I'm having trouble using group_by() on multiple columns. An example dataset is the following:

dput(test)
structure(list(timestamp = structure(c(1506676980, 1506676980, 
1506676980, 1506677040, 1506677280, 1506677340, 1506677460), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), plusminus = c(-1, 1, 1, 1, 1, 1, -1
), AP = structure(c(1L, 2L, 2L, 2L, 2L, 1L, 2L), .Label = c("A", 
"B"), class = "factor")), .Names = c("timestamp", "plusminus", 
"AP"), row.names = c(NA, -7L), class = "data.frame")

It looks as follows:

            timestamp plusminus AP
1 2017-09-29 09:23:00        -1  A
2 2017-09-29 09:23:00         1  B
3 2017-09-29 09:23:00         1  B
4 2017-09-29 09:24:00         1  B
5 2017-09-29 09:28:00         1  B
6 2017-09-29 09:29:00         1  A
7 2017-09-29 09:31:00        -1  B

I would like to do the following:

  1. compute a running total for each level in the 'AP' variable
  2. to aggregate for each minute the maximum value of the running total.

In other words, I want this output:

            timestamp total AP
1 2017-09-29 09:23:00    -1  A
2 2017-09-29 09:23:00     2  B
3 2017-09-29 09:24:00     3  B
4 2017-09-29 09:28:00     4  B
5 2017-09-29 09:29:00     0  A
6 2017-09-29 09:31:00     3  B

It's easy to do part 1 via:

test %>% group_by(AP) %>% mutate(total = cumsum(plusminus))

which gives:

# A tibble: 7 x 4
# Groups:   AP [2]
            timestamp plusminus     AP total
               <dttm>     <dbl> <fctr> <dbl>
1 2017-09-29 09:23:00        -1      A    -1
2 2017-09-29 09:23:00         1      B     1
3 2017-09-29 09:23:00         1      B     2
4 2017-09-29 09:24:00         1      B     3
5 2017-09-29 09:28:00         1      B     4
6 2017-09-29 09:29:00         1      A     0
7 2017-09-29 09:31:00        -1      B     3

but I'm not sure how to do part 2. That is, I would like to know how to perform the aggregation such that the second row in the latter dataframe is surpressed to give the desired output.

After you calculate the running totals, you need to re-group to get each of the timestamp-AP pairs together, then summarise to keep the maximum value. If you want to keep the last value (instead of the max), you can just keep the last row (you could also do that with slice(n()) ). Here, the answers are the same, but make sure that would be the case for your data.

test %>%
  group_by(AP) %>%
  mutate(total = cumsum(plusminus)) %>%
  group_by(timestamp, AP) %>%
  summarise(maxTotal = max(total)
            , lastTotal = total[n()])

gives

            timestamp     AP maxTotal lastTotal
               <dttm> <fctr>    <dbl>     <dbl>
1 2017-09-29 09:23:00      A       -1        -1
2 2017-09-29 09:23:00      B        2         2
3 2017-09-29 09:24:00      B        3         3
4 2017-09-29 09:28:00      B        4         4
5 2017-09-29 09:29:00      A        0         0
6 2017-09-29 09:31:00      B        3         3

Here is a data.table approach:

DATA

p <- structure(list(timestamp = structure(c(1506676980, 1506676980, 
1506676980, 1506677040, 1506677280, 1506677340, 1506677460), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), plusminus = c(-1, 1, 1, 1, 1, 1, -1
), AP = structure(c(1L, 2L, 2L, 2L, 2L, 1L, 2L), .Label = c("A", 
"B"), class = "factor")), .Names = c("timestamp", "plusminus", 
"AP"), row.names = c(NA, -7L), class = "data.frame")

CODE

library(data.table)
p <- as.data.table(p)
p[, total:= cumsum(plusminus), by = AP][, max(total), by = .(AP, lubridate::round_date(timestamp, unit = "min"))]

OUTPUT

   AP           lubridate V1
1:  A 2017-09-29 09:23:00 -1
2:  B 2017-09-29 09:23:00  2
3:  B 2017-09-29 09:24:00  3
4:  B 2017-09-29 09:28:00  4
5:  A 2017-09-29 09:29:00  0
6:  B 2017-09-29 09:31:00  3

The above snippet uses "chaining" (you can consider it similar to the %>% approach) to get the desired output. First we get a cumulative sum by AP and save that to total . In the second step we group by AP and timestamp (to the nearest minute) and get the max value of the newly defined variable total .

I find data.table has a very clean approach that works very well for large datasets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM