Here is some Example Data:
Begin = c("10-10-2010 12:15:35", "10-10-2010 12:20:52", "10-10-2010 12:23:45", "10-10-2010 12:25:01", "10-10-2010 12:30:29")
End = c("10-10-2010 12:24:23", "10-10-2010 12:23:30", "10-10-2010 12:45:15", "10-10-2010 12:32:11", "10-10-2010 12:45:05")
df = data.frame(Begin, End)
I want to count the number of events that have not currently finished when a new event begins and record it in a new column. So for this particular example the end result that is desired would be a column with values: 0, 1, 1, 1, 2
I have a solution on how to do this with data.table and it worked fine. I would like to be able to find a solution that works in the RevoScaleR/mrsdeploy packages so the program that does this can take advantage of parallel computing/data chunking.
Here is the solution that works in data.table:
library(lubridate)
library(data.table)
df <- as.data.frame(lapply(df, dmy_hms))
dt <- as.data.table(df)
setkey(dt,Begin,End)[,id:=.I]
merge(dt, foverlaps(dt,dt)[id>i.id,.N,by="Begin,End"], all.x=T)[,id:=NULL][is.na(N),N:=0][]
Again, I am looking for one that can be executed remotely on SQLSERVER2016 with the packages mentioned.
Process begin and end in ascending order, and keep a count of how many begins and ends you have seen. If you don't have duplicate/spurious end events, this will work just fine.
This seems to do it with a simple sapply
sapply(df$Begin, function(x) sum((x < df$End) & (x > df$Begin)))
To parallelize it just use rxExec
, mclapply
, parLapply
, foreach
, etc.
I found a way to do this in t-sql that was the quickest way. That information is located here: http://sqlmag.com/t-sql/intervals-and-counts-part-1
It could also be translated to R easily for anyone doing this in the future. I chose to just complete the operation in t-sql though.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.