R - Optimizing loop operation on long table format

Question

I have a long format table data containing around 1.2 million rows in the below format

+-----------+-------+
| ProcessId | Event |
+-----------+-------+
|       111 | A     |
|       111 | B     |
|       111 | C     |
|       111 | D     |
|       111 | E     |
|       112 | A     |
|       112 | B     |
|       112 | D     |
|       112 | E     |
|       113 | A     |
|       113 | B     |
|       113 | C     |
|       113 | D     |
|       113 | E     |
+-----------+-------+

The events are already sorted by their time-stamp and I want to generate the process flow string for each process id from the long format table into a separate column to another table like below

processflow <- ('A->B->C->D->E' , 'A->B->D->E' ,'A->B->C->D->E' ,..etc)

I tried the below lines of code. but each of them is taking almost 850 seconds in my core i3, 4GB RAM laptop. what are the alternative ways to do this faster ? Any guidance will be helpful

Version 1 using dplyr package

state_flow_identifier <- function(x){
y <-filter(data_longformat,process_id==x)
y_v <- as.vector(y$Event)
as.character(paste(y_v,collapse = '->'))
}

processflow_detection <- mutate(processflow_detection,processflow = apply(processflow_detection['process_id'],1,function(x)state_flow_identifier(x)))

Version 2 using foreach package

processflow <- foreach(i=processflow_detection$process_id,.combine='c') %do% state_flow_identifier(i)

Thanks.

Answer 1

You should also check out data.table for efficiency, (Using @Gopala's data frame)

library(data.table)
setDT(df)[, .(event_flow = paste(Event, collapse = '->')), by = ProcessId]
#   ProcessId    event_flow
#1:       111 A->B->C->D->E
#2:       112 A->B->C->D->E
#3:       113 A->B->C->D->E

Answer 2

You can try something like this with dplyr and it's group_by function:

library(dplyr)
df %>%
  group_by(ProcessId) %>%
  summarise(eventFlow = paste(Event, collapse = ' -> '))

Output will be as follows:

Source: local data frame [3 x 2]

  ProcessId             eventFlow
      (dbl)                 (chr)
1       111 A -> B -> C -> D -> E
2       112 A -> B -> C -> D -> E
3       113 A -> B -> C -> D -> E

Note, I used this data frame as input since your's was not in a format to input into R:

df <- data.frame(ProcessId = c(rep(111, 5), rep(112, 5), rep(113, 5)), Event = rep(c('A', 'B', 'C', 'D', 'E'), 3))
df
   ProcessId Event
1        111     A
2        111     B
3        111     C
4        111     D
5        111     E
6        112     A
7        112     B
8        112     C
9        112     D
10       112     E
11       113     A
12       113     B
13       113     C
14       113     D
15       113     E

Answer 3

You don't need any libraries.

df <- data.frame(ProcessId=c(rep(111,5),rep(112,4),rep(113,5)),Event=unlist(strsplit("ABCDEABDEABCDE","")))
len <- nrow(df)
new_process_id <- c(TRUE, df$ProcessId[-len] != df$ProcessId[-1])
sym <- rep(" -> ", len)
sym[new_process_id] <- ","
str <- paste(sym, df$Event, sep="", collapse="")
data.frame(ProcessId=rle(df$ProcessId)$val,EventFlow=unlist(strsplit(str,","))[-1])

Output:

  ProcessId             EventFlow
1       111 A -> B -> C -> D -> E
2       112      A -> B -> D -> E
3       113 A -> B -> C -> D -> E

For 1.25 mln lines, the execution time was about 2 seconds on my laptop

R - Optimizing loop operation on long table format

Question

3 answers

solution1
2 2016-05-07 16:23:31

solution2
1 ACCPTED 2016-05-07 15:49:44

solution3
0 2016-05-07 16:13:55

R - Optimizing loop operation on long table format

Question

3 answers

solution1 2 2016-05-07 16:23:31

solution2 1 ACCPTED 2016-05-07 15:49:44

solution3 0 2016-05-07 16:13:55

solution1
2 2016-05-07 16:23:31

solution2
1 ACCPTED 2016-05-07 15:49:44

solution3
0 2016-05-07 16:13:55