简体   繁体   中英

Identify patterns based on start and end points

I would like to identify the duration of an activity that starts at t1 and end at t7. The starting point is t1 which records the occurrence of activity at t1_1, t1_2, t1_3, and so on. For example in the case of id 12 activity occurred at t1_2 till t3_1 (i would like to save all occurrences). I would like to identify to the start and end all id's in which activity occurred consequently more than 4 times (eg. 4times occurred number 1), the duration, and the most frequent one. Zero defines the boundaries of the sequence (eg. a sequence ends and starts with one and is preceded by zero)

Input:

id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12  0    1     1    1     1   1    1    0    0    0    1    0    0    1    0     1   1     1   1      0  1
123 0    0     0    1     1   1    0    0    0    1    1    1    1    1    1     0   0     0    1     1  1
 10  1   1     1    1     1    1    1   1    1    1    1    1    1    1    1     1   1     1    1     1  1   

Output for id 12

Id    Start/End                            Duration  Frequency
12   t1_2, t1_3, t2_1, t2_2, t2_3, t3_1           6        1
12   t6_1, t6_2, t6_3, t7_1                       4         1

Sample data

 df1 <- structure(list(serial = c(12L, 123L, 10L), t1_1 = c(0L, 0L, 1L), 
                t1_2 = c(1L, 0L, 1L), t1_3 = c(1L, 0L, 1L), t2_1 = c(0L, 
                1L, 1L), t2_2 = c(1L, 1L, 1L), t2_3 = c(0L, 1L, 1L), t3_1 = c(1L, 
                0L, 1L), t3_2 = c(0L, 0L, 1L), t3_3 = c(1L, 0L, 1L), t4_1 = c(0L, 
                1L, 1L), t4_2 = c(1L, 1L, 1L), t4_3 = c(0L, 1L, 1L), t5_1 = c(0L, 
                1L, 1L), t5_2 = c(1L, 1L, 1L), t5_3 = c(0L, 1L, 1L), t6_1 = c(1L, 
                0L, 1L), t6_2 = c(1L, 0L, 1L), t6_3 = c(1L, 0L, 1L), t7_1 = c(0L, 
                1L, 1L), t7_2 = c(0L, 1L, 1L), t7_3 = c(1L, 1L, 1L)), 
                class = "data.frame", row.names = c(NA, 
            -3L))

Code so far

df1 <- melt(setDT(df1), id.var = 'serial')
df1[, c('time', 'subtime') := tstrsplit(as.character(variable), "_", fixed = TRUE)]
df2 <- df1[, rle(value), by = .(serial, time)][lengths > 1 & values == 1, ]
df3 <- df1[df2, on = c('serial', 'time')]
df3 <- df3[, .(`Start/End` = paste0(time, '_', c(min(subtime), max(subtime)), collapse = " - "), 
               Duration = unique(lengths)), 
           by = .(serial, time)]
df3[, Frequency := .N, by = .(serial, `Start/End`)]
df3[, time := NULL]
df3[order(serial), ]

I would suggest next approach using tidyverse functions. You want to identify sequences so next code could be useful. The main idea is to reformat data and split the time variables ( t ) so that you create ids for the sequences and then aggregate:

library(tidyverse)

df1 %>% arrange(serial) %>% pivot_longer(cols = -serial) %>%
  #Duplicate the variable with time
  mutate(name2=name) %>%
  #Split time so that you have categories by t1, t2,...
  separate(name2,into = c('var1','var2'),sep = '_') %>%
  #Group by main id, the categories and value
  group_by(serial,var1,value) %>%
  #Create an unique id for sequences
  mutate(id=cur_group_id()) %>%
  #Omit values in zero which are not patterns
  ungroup() %>% filter(value!=0) %>%
  #Aggregate with the new id
  group_by(serial,id) %>%
  #Compute outputs
  summarise(chain=paste0(name,collapse = ','),Duration=n()) %>%
  select(-id) -> dfprime

The output (I include only serial 12):

# A tibble: 7 x 3
# Groups:   serial [1]
  serial chain          Duration
   <int> <chr>             <int>
1     12 t1_2,t1_3             2
2     12 t2_2                  1
3     12 t3_1,t3_3             2
4     12 t4_2                  1
5     12 t5_2                  1
6     12 t6_1,t6_2,t6_3        3
7     12 t7_3                  1

If you want to make other aggregations you could work over the final dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM