R - Finding max number of consecutive values by key

Question

I have a dataset of messages sent to users, some have succeeded and some have failed:

> df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
> df.messages
               date id status
1  2018-01-01 12:00  1      S
2  2018-01-01 12:00  2      S
3  2018-01-01 12:00  3      S
4  2018-01-02 12:00  1      S
5  2018-01-02 12:00  2      S
6  2018-01-02 12:00  3      F
7  2018-01-03 12:00  1      S
8  2018-01-03 12:00  2      F
9  2018-01-03 12:00  3      F
10 2018-01-04 12:00  1      F
11 2018-01-04 12:00  2      F
12 2018-01-04 12:00  3      S

Here's what to note:

One message is sent each day, over four days
id 1 succeeds (S) three times, then fails (F)
id 2 succeeds twice, then fails twice
id 3 succeeds once, then fails twice, then succeeds

I would like to break the users into four groups

those that always succeeded
those that failed, then succeeded later
those that succeeded, then failed without ever succeeding again
those that always failed

And then understand

the max number of times a user in group 2 failed before succeeding again
the max amount of time a user in group 2 failed before succeeding again
the max number of times a user in group 3 failed
the max amount of time a user in group 3 failed

The ideal output would be

  id group num_f_messages date_f_messages
1  1     3              1               1
2  2     3              2               2
3  3     2              2               2

I know I need to use rle() and diff() , but it's getting complicated and I haven't had to do this type of analysis before. I'm pretty lost.

I have 9MM rows, so I'm trying to accomplish this with data.table, but any solutions are welcome.

Edit:

I'm trying to extend this function to a larger dataset. So in a scenario where id 3's messages were "S,F,F,S,F,F,F,S", I need to reflect a maximum of 3 Fs before the final S.

Answer 1

You can try this:

require(plyr); require(dplyr)

df.messages %>% 
  group_by(id) %>% 
  summarise(group = ifelse(sum(status == "S") == n(), 1, 
                           ifelse(sum(status == "F") == n(), 4,
                                  ifelse(n_distinct(status) > 1 & 
                                           status[1] == "S" & status[n()] == "S", 2, 3))),
            num_f_messages = sum(status == "F"),
            date_f_messages = n_distinct(date[status == "F"]))

gives you:

# A tibble: 3 x 4
     id group num_f_messages date_f_messages
  <dbl> <dbl>          <int>           <int>
1     1     3              1               1
2     2     3              2               2
3     3     2              2               2

Answer 2

Here is a data.table solution.

library(data.table)
library(magrittr)

df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
df.messages$status <- as.character(df.messages$status)
setDT(df.messages)


ans <- df.messages[,
            .(
                by_rle = paste0(rle(status)$value, collapse = ""),
                num_f_message = sum(status == "F"),
                date_f_message = length(unique(date[status == "F"]))
            ), 
            by = id] %>%
    # define groups and remove the by_rle columns
    .[by_rle == "S", group := 1] %>%
    .[by_rle == c("SFS"), group := 2] %>%
    .[by_rle == c("SF"), group := 3] %>%
    .[by_rle == "F", group := 4] %>%
    .[, by_rle := NULL] %>%
    setcolorder(c("id", "group", "num_f_message", "date_f_message"))

#    id group num_f_message date_f_message
# 1:  1     3             1              1
# 2:  2     3             2              2
# 3:  3     2             2              2

R - Finding max number of consecutive values by key

Question

2 answers

solution1
1 2018-01-23 14:02:48

solution2
1 ACCPTED 2018-01-23 17:20:55

R - Finding max number of consecutive values by key

Question

2 answers

solution1 1 2018-01-23 14:02:48

solution2 1 ACCPTED 2018-01-23 17:20:55

solution1
1 2018-01-23 14:02:48

solution2
1 ACCPTED 2018-01-23 17:20:55