简体   繁体   中英

How to subset and find time difference between two data points using bike station data

I am experimenting with bike station data and have a for loop that extracts bikes that started at different stations than where they stopped, then rearranges stoptime and startime to show the movement of the bike by the operator (from where it stopped, to where it started), and the difftime or difference in time between when it started and last ended.

Sample data

            starttime            stoptime start.station.id end.station.id bikeid
1 2017-01-16 13:08:18 2017-01-16 13:28:13             3156            466      1
2 2017-01-10 19:10:31 2017-01-10 19:16:02              422           3090      1
3 2017-01-04 08:47:42 2017-01-04 08:57:10              507            442      1
4 2017-01-12 18:08:33 2017-01-12 18:36:09              546           3151      2
5 2017-01-21 09:52:13 2017-01-21 10:21:07             3243            212      2
6 2017-01-26 05:46:18 2017-01-26 05:49:13              470            168      2

My code

raw_data = test

unique_id = unique(raw_data$bikeid)
output1 <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0),  "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)

for (bikeid in unique_id)
{
  onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
  onebike$starttime <- strptime(onebike$starttime, "%Y-%m-%d %H:%M:%S", tz = "EST")
  onebike <- onebike[order(onebike$starttime, decreasing = FALSE),]

  if(nrow(onebike) >=2 ){
    for(i in 2:nrow(onebike )) {
      print(onebike)
      if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
         onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
        diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%Y-%m-%d %H:%M:%S", tz = "EST"),
                                        strptime(onebike[i-1,"stoptime"], "%Y-%m-%d %H:%M:%S", tz = "EST")
                                        ,units = "secs"))
        new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
        output1[nrow(output1) + 1,] = new_row
      }
    }
  }
}

Output

  bikeid end.station.id start.station.id diff.time            stoptime           starttime
1      1            442              422    555201 2017-01-04 08:57:10 2017-01-10 19:10:31
2      1           3090             3156    496336 2017-01-10 19:16:02 2017-01-16 13:08:18
3      2           3151             3243    746164 2017-01-12 18:36:09 2017-01-21 09:52:13
4      2            212              470    415511 2017-01-21 10:21:07 2017-01-26 05:46:18
5      3           3112              351   1587161 2017-01-12 08:58:42 2017-01-30 17:51:23

However, on a large dataset this for loop takes a very very long time. Is there a way to dplyr or data.table to speed up this loop or rearrange the data in a way that avoids looping? Would appreciate any kind of explanation or suggestions

Sample data (in dput)

structure(list(starttime = structure(c(1484572098, 1484075431, 
1483519662, 1484244513, 1484992333, 1485409578, 1484210616, 1483727948, 
1485798683), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    stoptime = structure(c(1484573293, 1484075762, 1483520230, 
    1484246169, 1484994067, 1485409753, 1484211522, 1483729024, 
    1485799997), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    start.station.id = c(3156L, 422L, 507L, 546L, 3243L, 470L, 
    439L, 309L, 351L), end.station.id = c(466L, 3090L, 442L, 
    3151L, 212L, 168L, 3112L, 439L, 433L), bikeid = c(1, 1, 1, 
    2, 2, 2, 3, 3, 3)), .Names = c("starttime", "stoptime", "start.station.id", 
"end.station.id", "bikeid"), row.names = c(NA, -9L), class = "data.frame")

One approach would be the following. I called your data foo. You perhaps want to start sorting your data by bikeid and starttime . Then, for each bikeid , you want to create new columns (ie, next.start.station.id and next.start.time ) using lead() . You also want to find the time difference using difftime() . After that you want to remove rows that have a same id for end.station.id and next.start.station.id . Finally, you arrange columns as you wish.

library(dplyr)

foo %>%
arrange(bikeid, starttime) %>%  # if necessary, arrange(bikeid, starttime, stoptime)
group_by(bikeid) %>%
mutate(next.start.station.id = lead(start.station.id),
       next.start.time = lead(starttime),
       diff.time = difftime(next.start.time, stoptime, units = "secs")) %>%
filter(end.station.id != next.start.station.id) %>%
select(bikeid, end.station.id, next.start.station.id, diff.time, stoptime, next.start.time)


   bikeid end.station.id next.start.station.id diff.time stoptime            next.start.time    
    <dbl>          <int>                 <int> <time>    <dttm>              <dttm>             
 1   1.00            442                   422 555201    2017-01-04 08:57:10 2017-01-10 19:10:31
 2   1.00           3090                  3156 496336    2017-01-10 19:16:02 2017-01-16 13:08:18
 3   2.00           3151                  3243 746164    2017-01-12 18:36:09 2017-01-21 09:52:13
 4   2.00            212                   470 415511    2017-01-21 10:21:07 2017-01-26 05:46:18
 5   3.00           3112                   351 1587161   2017-01-12 08:58:42 2017-01-30 17:51:23

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM