简体   繁体   中英

R - Finding max number of consecutive values by key

I have a dataset of messages sent to users, some have succeeded and some have failed:

> df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
> df.messages
               date id status
1  2018-01-01 12:00  1      S
2  2018-01-01 12:00  2      S
3  2018-01-01 12:00  3      S
4  2018-01-02 12:00  1      S
5  2018-01-02 12:00  2      S
6  2018-01-02 12:00  3      F
7  2018-01-03 12:00  1      S
8  2018-01-03 12:00  2      F
9  2018-01-03 12:00  3      F
10 2018-01-04 12:00  1      F
11 2018-01-04 12:00  2      F
12 2018-01-04 12:00  3      S

Here's what to note:

  • One message is sent each day, over four days
  • id 1 succeeds (S) three times, then fails (F)
  • id 2 succeeds twice, then fails twice
  • id 3 succeeds once, then fails twice, then succeeds

I would like to break the users into four groups

  1. those that always succeeded
  2. those that failed, then succeeded later
  3. those that succeeded, then failed without ever succeeding again
  4. those that always failed

And then understand

  • the max number of times a user in group 2 failed before succeeding again
  • the max amount of time a user in group 2 failed before succeeding again
  • the max number of times a user in group 3 failed
  • the max amount of time a user in group 3 failed

The ideal output would be

  id group num_f_messages date_f_messages
1  1     3              1               1
2  2     3              2               2
3  3     2              2               2

I know I need to use rle() and diff() , but it's getting complicated and I haven't had to do this type of analysis before. I'm pretty lost.

I have 9MM rows, so I'm trying to accomplish this with data.table, but any solutions are welcome.

Edit:

I'm trying to extend this function to a larger dataset. So in a scenario where id 3's messages were "S,F,F,S,F,F,F,S", I need to reflect a maximum of 3 Fs before the final S.

You can try this:

require(plyr); require(dplyr)

df.messages %>% 
  group_by(id) %>% 
  summarise(group = ifelse(sum(status == "S") == n(), 1, 
                           ifelse(sum(status == "F") == n(), 4,
                                  ifelse(n_distinct(status) > 1 & 
                                           status[1] == "S" & status[n()] == "S", 2, 3))),
            num_f_messages = sum(status == "F"),
            date_f_messages = n_distinct(date[status == "F"]))

gives you:

# A tibble: 3 x 4
     id group num_f_messages date_f_messages
  <dbl> <dbl>          <int>           <int>
1     1     3              1               1
2     2     3              2               2
3     3     2              2               2

Here is a data.table solution.

library(data.table)
library(magrittr)

df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
df.messages$status <- as.character(df.messages$status)
setDT(df.messages)


ans <- df.messages[,
            .(
                by_rle = paste0(rle(status)$value, collapse = ""),
                num_f_message = sum(status == "F"),
                date_f_message = length(unique(date[status == "F"]))
            ), 
            by = id] %>%
    # define groups and remove the by_rle columns
    .[by_rle == "S", group := 1] %>%
    .[by_rle == c("SFS"), group := 2] %>%
    .[by_rle == c("SF"), group := 3] %>%
    .[by_rle == "F", group := 4] %>%
    .[, by_rle := NULL] %>%
    setcolorder(c("id", "group", "num_f_message", "date_f_message"))

#    id group num_f_message date_f_message
# 1:  1     3             1              1
# 2:  2     3             2              2
# 3:  3     2             2              2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM