简体   繁体   English

如何使用自行车站数据对两个数据点之间的时间差进行子集和查找

[英]How to subset and find time difference between two data points using bike station data

I am experimenting with bike station data and have a for loop that extracts bikes that started at different stations than where they stopped, then rearranges stoptime and startime to show the movement of the bike by the operator (from where it stopped, to where it started), and the difftime or difference in time between when it started and last ended. 我正在试验自行车站数据,并有一个for循环,提取从不同站点开始的自行车,然后重新安排停止时间和startime,以显示操作员的自行车运动(从停止的地方到它开始的地方) ),以及开始和结束之间的difftime或时间差。

Sample data 样本数据

            starttime            stoptime start.station.id end.station.id bikeid
1 2017-01-16 13:08:18 2017-01-16 13:28:13             3156            466      1
2 2017-01-10 19:10:31 2017-01-10 19:16:02              422           3090      1
3 2017-01-04 08:47:42 2017-01-04 08:57:10              507            442      1
4 2017-01-12 18:08:33 2017-01-12 18:36:09              546           3151      2
5 2017-01-21 09:52:13 2017-01-21 10:21:07             3243            212      2
6 2017-01-26 05:46:18 2017-01-26 05:49:13              470            168      2

My code 我的代码

raw_data = test

unique_id = unique(raw_data$bikeid)
output1 <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0),  "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)

for (bikeid in unique_id)
{
  onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
  onebike$starttime <- strptime(onebike$starttime, "%Y-%m-%d %H:%M:%S", tz = "EST")
  onebike <- onebike[order(onebike$starttime, decreasing = FALSE),]

  if(nrow(onebike) >=2 ){
    for(i in 2:nrow(onebike )) {
      print(onebike)
      if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
         onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
        diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%Y-%m-%d %H:%M:%S", tz = "EST"),
                                        strptime(onebike[i-1,"stoptime"], "%Y-%m-%d %H:%M:%S", tz = "EST")
                                        ,units = "secs"))
        new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
        output1[nrow(output1) + 1,] = new_row
      }
    }
  }
}

Output 产量

  bikeid end.station.id start.station.id diff.time            stoptime           starttime
1      1            442              422    555201 2017-01-04 08:57:10 2017-01-10 19:10:31
2      1           3090             3156    496336 2017-01-10 19:16:02 2017-01-16 13:08:18
3      2           3151             3243    746164 2017-01-12 18:36:09 2017-01-21 09:52:13
4      2            212              470    415511 2017-01-21 10:21:07 2017-01-26 05:46:18
5      3           3112              351   1587161 2017-01-12 08:58:42 2017-01-30 17:51:23

However, on a large dataset this for loop takes a very very long time. 但是,在大型数据集上,这个for循环需要很长时间。 Is there a way to dplyr or data.table to speed up this loop or rearrange the data in a way that avoids looping? 有没有办法使用dplyrdata.table加速这个循环或以避免循环的方式重新排列数据? Would appreciate any kind of explanation or suggestions 不胜感激任何解释或建议

Sample data (in dput) 样本数据(在输入中)

structure(list(starttime = structure(c(1484572098, 1484075431, 
1483519662, 1484244513, 1484992333, 1485409578, 1484210616, 1483727948, 
1485798683), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    stoptime = structure(c(1484573293, 1484075762, 1483520230, 
    1484246169, 1484994067, 1485409753, 1484211522, 1483729024, 
    1485799997), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    start.station.id = c(3156L, 422L, 507L, 546L, 3243L, 470L, 
    439L, 309L, 351L), end.station.id = c(466L, 3090L, 442L, 
    3151L, 212L, 168L, 3112L, 439L, 433L), bikeid = c(1, 1, 1, 
    2, 2, 2, 3, 3, 3)), .Names = c("starttime", "stoptime", "start.station.id", 
"end.station.id", "bikeid"), row.names = c(NA, -9L), class = "data.frame")

One approach would be the following. 一种方法如下。 I called your data foo. 我打电话给你的数据foo。 You perhaps want to start sorting your data by bikeid and starttime . 您可能希望通过bikeidstarttime开始对数据进行排序。 Then, for each bikeid , you want to create new columns (ie, next.start.station.id and next.start.time ) using lead() . 然后,对于每个bikeid ,您希望使用lead()创建新列(即next.start.station.idnext.start.time lead() You also want to find the time difference using difftime() . 您还想使用difftime()找到时差。 After that you want to remove rows that have a same id for end.station.id and next.start.station.id . 之后,您要删除end.station.idnext.start.station.id具有相同ID的行。 Finally, you arrange columns as you wish. 最后,您可以根据需要排列列。

library(dplyr)

foo %>%
arrange(bikeid, starttime) %>%  # if necessary, arrange(bikeid, starttime, stoptime)
group_by(bikeid) %>%
mutate(next.start.station.id = lead(start.station.id),
       next.start.time = lead(starttime),
       diff.time = difftime(next.start.time, stoptime, units = "secs")) %>%
filter(end.station.id != next.start.station.id) %>%
select(bikeid, end.station.id, next.start.station.id, diff.time, stoptime, next.start.time)


   bikeid end.station.id next.start.station.id diff.time stoptime            next.start.time    
    <dbl>          <int>                 <int> <time>    <dttm>              <dttm>             
 1   1.00            442                   422 555201    2017-01-04 08:57:10 2017-01-10 19:10:31
 2   1.00           3090                  3156 496336    2017-01-10 19:16:02 2017-01-16 13:08:18
 3   2.00           3151                  3243 746164    2017-01-12 18:36:09 2017-01-21 09:52:13
 4   2.00            212                   470 415511    2017-01-21 10:21:07 2017-01-26 05:46:18
 5   3.00           3112                   351 1587161   2017-01-12 08:58:42 2017-01-30 17:51:23

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM