简体   繁体   中英

Identify the start and end of a trip in R

I would like to identify the duration of an activity that start at t1 and end at t7. The starting point is t1 that records the occurrence of an activity at t1_1, t1_2, t1_3 and so on. For example in the case of id 12 activity occurred at t1_2 and t1_3 (i would like to save this) t2_2 (as there is no activity before and after I am not intrested in this activity), t3_1 (same as t2_2), t3_3, t4_2, t5_2, t6_1, t6_2, t6_3 and t7_3. I would like to identify to the start and end all id's in which an activity occured, the duration and the most frequent one.

Input:

id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12  0    1     1    0     1   0    1    0    1    0    1    0    0    1    0     1   1     1   0      0  1
123 0    0     0    1     1   1    0    0    0    1    1    1    1    1    1     0   0     0    1     1  1
 10  1   1     1    1     1    1    1   1    1    1    1    1    1    1    1     1   1     1    1     1  1   

Output for id 12

Id    Start/End    Duration  Frequency
12   t1_1, t1_3     2         1
12   t6_1, t6_3     3         1

One way to dot this is using the bioconductor library but are there any better solution?

Sample data

df1 <- structure(list(id = c(12L, 123L, 10L), t1_1 = c(0L, 0L, 1L), 
            t1_2 = c(1L, 0L, 1L), t1_3 = c(1L, 0L, 1L), t2_1 = c(0L, 
            1L, 1L), t2_2 = c(1L, 1L, 1L), t2_3 = c(0L, 1L, 1L), t3_1 = c(1L, 
            0L, 1L), t3_2 = c(0L, 0L, 1L), t3_3 = c(1L, 0L, 1L), t4_1 = c(0L, 
            1L, 1L), t4_2 = c(1L, 1L, 1L), t4_3 = c(0L, 1L, 1L), t5_1 = c(0L, 
            1L, 1L), t5_2 = c(1L, 1L, 1L), t5_3 = c(0L, 1L, 1L), t6_1 = c(1L, 
            0L, 1L), t6_2 = c(1L, 0L, 1L), t6_3 = c(1L, 0L, 1L), t7_1 = c(0L, 
            1L, 1L), t7_2 = c(0L, 1L, 1L), t7_3 = c(1L, 1L, 1L)), 
            class = "data.frame", row.names = c(NA, 
        -3L))

We convert to 'long' format with pivot_longer , then create a grouping variable with rleid (from data.table ) based on the occurrence of similar adjacent elements in 'value', filter the rows where the 'value' is 1, grouped by 'id', 'grp', we keep only rows where the frequency count is greater than 1, summarise by paste ing ( str_c ) the first and last elements of 'name' as well get the count ( n() ) and arrange if necessary

library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df1 %>% 
   pivot_longer(cols = -id) %>% 
   mutate(grp = rleid(value)) %>%
   filter(as.logical(value)) %>%
   group_by(id, grp) %>% 
   filter(n() > 1) %>%
   summarise(Start_End = str_c(first(name), last(name), sep=", "),
       Duration = n()) %>%
   arrange(id, grp)
library('data.table')
df1 <- melt(setDT(df1), id.var = 'id')
df1[, c('time', 'subtime') := tstrsplit(as.character(variable), "_", fixed = TRUE)]
df2 <- df1[, rle(value), by = .(id, time)][lengths > 1 & values == 1, ]
df3 <- df1[df2, on = c('id', 'time')]
df3 <- df3[, .(`Start/End` = paste0(time, '_', c(min(subtime), max(subtime)), collapse = " - "), 
                           Duration = unique(lengths)), 
           by = .(id, time)]
df3[, Frequency := .N, by = .(id, `Start/End`)]
df3[, time := NULL]
df3[order(id), ]

#     id   Start/End Duration Frequency
# 1:  10 t1_1 - t1_3        3         1
# 2:  10 t2_1 - t2_3        3         1
# 3:  10 t3_1 - t3_3        3         1
# 4:  10 t4_1 - t4_3        3         1
# 5:  10 t5_1 - t5_3        3         1
# 6:  10 t6_1 - t6_3        3         1
# 7:  10 t7_1 - t7_3        3         1
# 8:  12 t1_1 - t1_3        2         1
# 9:  12 t6_1 - t6_3        3         1
# 10: 123 t2_1 - t2_3        3         1
# 11: 123 t4_1 - t4_3        3         1
# 12: 123 t5_1 - t5_3        3         1
# 13: 123 t7_1 - t7_3        3         1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM