簡體   English   中英

如何使用自行車站數據對兩個數據點之間的時間差進行子集和查找

[英]How to subset and find time difference between two data points using bike station data

我正在試驗自行車站數據,並有一個for循環,提取從不同站點開始的自行車,然后重新安排停止時間和startime,以顯示操作員的自行車運動(從停止的地方到它開始的地方) ),以及開始和結束之間的difftime或時間差。

樣本數據

            starttime            stoptime start.station.id end.station.id bikeid
1 2017-01-16 13:08:18 2017-01-16 13:28:13             3156            466      1
2 2017-01-10 19:10:31 2017-01-10 19:16:02              422           3090      1
3 2017-01-04 08:47:42 2017-01-04 08:57:10              507            442      1
4 2017-01-12 18:08:33 2017-01-12 18:36:09              546           3151      2
5 2017-01-21 09:52:13 2017-01-21 10:21:07             3243            212      2
6 2017-01-26 05:46:18 2017-01-26 05:49:13              470            168      2

我的代碼

raw_data = test

unique_id = unique(raw_data$bikeid)
output1 <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0),  "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)

for (bikeid in unique_id)
{
  onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
  onebike$starttime <- strptime(onebike$starttime, "%Y-%m-%d %H:%M:%S", tz = "EST")
  onebike <- onebike[order(onebike$starttime, decreasing = FALSE),]

  if(nrow(onebike) >=2 ){
    for(i in 2:nrow(onebike )) {
      print(onebike)
      if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
         onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
        diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%Y-%m-%d %H:%M:%S", tz = "EST"),
                                        strptime(onebike[i-1,"stoptime"], "%Y-%m-%d %H:%M:%S", tz = "EST")
                                        ,units = "secs"))
        new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
        output1[nrow(output1) + 1,] = new_row
      }
    }
  }
}

產量

  bikeid end.station.id start.station.id diff.time            stoptime           starttime
1      1            442              422    555201 2017-01-04 08:57:10 2017-01-10 19:10:31
2      1           3090             3156    496336 2017-01-10 19:16:02 2017-01-16 13:08:18
3      2           3151             3243    746164 2017-01-12 18:36:09 2017-01-21 09:52:13
4      2            212              470    415511 2017-01-21 10:21:07 2017-01-26 05:46:18
5      3           3112              351   1587161 2017-01-12 08:58:42 2017-01-30 17:51:23

但是,在大型數據集上,這個for循環需要很長時間。 有沒有辦法使用dplyrdata.table加速這個循環或以避免循環的方式重新排列數據? 不勝感激任何解釋或建議

樣本數據(在輸入中)

structure(list(starttime = structure(c(1484572098, 1484075431, 
1483519662, 1484244513, 1484992333, 1485409578, 1484210616, 1483727948, 
1485798683), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    stoptime = structure(c(1484573293, 1484075762, 1483520230, 
    1484246169, 1484994067, 1485409753, 1484211522, 1483729024, 
    1485799997), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    start.station.id = c(3156L, 422L, 507L, 546L, 3243L, 470L, 
    439L, 309L, 351L), end.station.id = c(466L, 3090L, 442L, 
    3151L, 212L, 168L, 3112L, 439L, 433L), bikeid = c(1, 1, 1, 
    2, 2, 2, 3, 3, 3)), .Names = c("starttime", "stoptime", "start.station.id", 
"end.station.id", "bikeid"), row.names = c(NA, -9L), class = "data.frame")

一種方法如下。 我打電話給你的數據foo。 您可能希望通過bikeidstarttime開始對數據進行排序。 然后,對於每個bikeid ,您希望使用lead()創建新列(即next.start.station.idnext.start.time lead() 您還想使用difftime()找到時差。 之后,您要刪除end.station.idnext.start.station.id具有相同ID的行。 最后,您可以根據需要排列列。

library(dplyr)

foo %>%
arrange(bikeid, starttime) %>%  # if necessary, arrange(bikeid, starttime, stoptime)
group_by(bikeid) %>%
mutate(next.start.station.id = lead(start.station.id),
       next.start.time = lead(starttime),
       diff.time = difftime(next.start.time, stoptime, units = "secs")) %>%
filter(end.station.id != next.start.station.id) %>%
select(bikeid, end.station.id, next.start.station.id, diff.time, stoptime, next.start.time)


   bikeid end.station.id next.start.station.id diff.time stoptime            next.start.time    
    <dbl>          <int>                 <int> <time>    <dttm>              <dttm>             
 1   1.00            442                   422 555201    2017-01-04 08:57:10 2017-01-10 19:10:31
 2   1.00           3090                  3156 496336    2017-01-10 19:16:02 2017-01-16 13:08:18
 3   2.00           3151                  3243 746164    2017-01-12 18:36:09 2017-01-21 09:52:13
 4   2.00            212                   470 415511    2017-01-21 10:21:07 2017-01-26 05:46:18
 5   3.00           3112                   351 1587161   2017-01-12 08:58:42 2017-01-30 17:51:23

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM