I have a long format table data containing around 1.2 million rows in the below format
+-----------+-------+ | ProcessId | Event | +-----------+-------+ | 111 | A | | 111 | B | | 111 | C | | 111 | D | | 111 | E | | 112 | A | | 112 | B | | 112 | D | | 112 | E | | 113 | A | | 113 | B | | 113 | C | | 113 | D | | 113 | E | +-----------+-------+
The events are already sorted by their time-stamp and I want to generate the process flow string for each process id from the long format table into a separate column to another table like below
processflow <- ('A->B->C->D->E' , 'A->B->D->E' ,'A->B->C->D->E' ,..etc)
I tried the below lines of code. but each of them is taking almost 850 seconds in my core i3, 4GB RAM laptop. what are the alternative ways to do this faster ? Any guidance will be helpful
Version 1 using dplyr package
state_flow_identifier <- function(x){
y <-filter(data_longformat,process_id==x)
y_v <- as.vector(y$Event)
as.character(paste(y_v,collapse = '->'))
}
processflow_detection <- mutate(processflow_detection,processflow = apply(processflow_detection['process_id'],1,function(x)state_flow_identifier(x)))
Version 2 using foreach package
processflow <- foreach(i=processflow_detection$process_id,.combine='c') %do% state_flow_identifier(i)
Thanks.
You should also check out data.table
for efficiency, (Using @Gopala's data frame)
library(data.table)
setDT(df)[, .(event_flow = paste(Event, collapse = '->')), by = ProcessId]
# ProcessId event_flow
#1: 111 A->B->C->D->E
#2: 112 A->B->C->D->E
#3: 113 A->B->C->D->E
You can try something like this with dplyr
and it's group_by
function:
library(dplyr)
df %>%
group_by(ProcessId) %>%
summarise(eventFlow = paste(Event, collapse = ' -> '))
Output will be as follows:
Source: local data frame [3 x 2]
ProcessId eventFlow
(dbl) (chr)
1 111 A -> B -> C -> D -> E
2 112 A -> B -> C -> D -> E
3 113 A -> B -> C -> D -> E
Note, I used this data frame as input since your's was not in a format to input into R:
df <- data.frame(ProcessId = c(rep(111, 5), rep(112, 5), rep(113, 5)), Event = rep(c('A', 'B', 'C', 'D', 'E'), 3))
df
ProcessId Event
1 111 A
2 111 B
3 111 C
4 111 D
5 111 E
6 112 A
7 112 B
8 112 C
9 112 D
10 112 E
11 113 A
12 113 B
13 113 C
14 113 D
15 113 E
You don't need any libraries.
df <- data.frame(ProcessId=c(rep(111,5),rep(112,4),rep(113,5)),Event=unlist(strsplit("ABCDEABDEABCDE","")))
len <- nrow(df)
new_process_id <- c(TRUE, df$ProcessId[-len] != df$ProcessId[-1])
sym <- rep(" -> ", len)
sym[new_process_id] <- ","
str <- paste(sym, df$Event, sep="", collapse="")
data.frame(ProcessId=rle(df$ProcessId)$val,EventFlow=unlist(strsplit(str,","))[-1])
Output:
ProcessId EventFlow
1 111 A -> B -> C -> D -> E
2 112 A -> B -> D -> E
3 113 A -> B -> C -> D -> E
For 1.25 mln lines, the execution time was about 2 seconds on my laptop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.