[英]How to subset and find time difference between two data points using bike station data
I am experimenting with bike station data and have a for loop that extracts bikes that started at different stations than where they stopped, then rearranges stoptime and startime to show the movement of the bike by the operator (from where it stopped, to where it started), and the difftime
or difference in time between when it started and last ended. 我正在试验自行车站数据,并有一个for循环,提取从不同站点开始的自行车,然后重新安排停止时间和startime,以显示操作员的自行车运动(从停止的地方到它开始的地方) ),以及开始和结束之间的
difftime
或时间差。
Sample data 样本数据
starttime stoptime start.station.id end.station.id bikeid
1 2017-01-16 13:08:18 2017-01-16 13:28:13 3156 466 1
2 2017-01-10 19:10:31 2017-01-10 19:16:02 422 3090 1
3 2017-01-04 08:47:42 2017-01-04 08:57:10 507 442 1
4 2017-01-12 18:08:33 2017-01-12 18:36:09 546 3151 2
5 2017-01-21 09:52:13 2017-01-21 10:21:07 3243 212 2
6 2017-01-26 05:46:18 2017-01-26 05:49:13 470 168 2
My code 我的代码
raw_data = test
unique_id = unique(raw_data$bikeid)
output1 <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0), "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)
for (bikeid in unique_id)
{
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
onebike$starttime <- strptime(onebike$starttime, "%Y-%m-%d %H:%M:%S", tz = "EST")
onebike <- onebike[order(onebike$starttime, decreasing = FALSE),]
if(nrow(onebike) >=2 ){
for(i in 2:nrow(onebike )) {
print(onebike)
if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%Y-%m-%d %H:%M:%S", tz = "EST"),
strptime(onebike[i-1,"stoptime"], "%Y-%m-%d %H:%M:%S", tz = "EST")
,units = "secs"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
output1[nrow(output1) + 1,] = new_row
}
}
}
}
Output 产量
bikeid end.station.id start.station.id diff.time stoptime starttime
1 1 442 422 555201 2017-01-04 08:57:10 2017-01-10 19:10:31
2 1 3090 3156 496336 2017-01-10 19:16:02 2017-01-16 13:08:18
3 2 3151 3243 746164 2017-01-12 18:36:09 2017-01-21 09:52:13
4 2 212 470 415511 2017-01-21 10:21:07 2017-01-26 05:46:18
5 3 3112 351 1587161 2017-01-12 08:58:42 2017-01-30 17:51:23
However, on a large dataset this for loop takes a very very long time. 但是,在大型数据集上,这个for循环需要很长时间。 Is there a way to
dplyr
or data.table
to speed up this loop or rearrange the data in a way that avoids looping? 有没有办法使用
dplyr
或data.table
加速这个循环或以避免循环的方式重新排列数据? Would appreciate any kind of explanation or suggestions 不胜感激任何解释或建议
Sample data (in dput) 样本数据(在输入中)
structure(list(starttime = structure(c(1484572098, 1484075431,
1483519662, 1484244513, 1484992333, 1485409578, 1484210616, 1483727948,
1485798683), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
stoptime = structure(c(1484573293, 1484075762, 1483520230,
1484246169, 1484994067, 1485409753, 1484211522, 1483729024,
1485799997), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
start.station.id = c(3156L, 422L, 507L, 546L, 3243L, 470L,
439L, 309L, 351L), end.station.id = c(466L, 3090L, 442L,
3151L, 212L, 168L, 3112L, 439L, 433L), bikeid = c(1, 1, 1,
2, 2, 2, 3, 3, 3)), .Names = c("starttime", "stoptime", "start.station.id",
"end.station.id", "bikeid"), row.names = c(NA, -9L), class = "data.frame")
One approach would be the following. 一种方法如下。 I called your data foo.
我打电话给你的数据foo。 You perhaps want to start sorting your data by
bikeid
and starttime
. 您可能希望通过
bikeid
和starttime
开始对数据进行排序。 Then, for each bikeid
, you want to create new columns (ie, next.start.station.id
and next.start.time
) using lead()
. 然后,对于每个
bikeid
,您希望使用lead()
创建新列(即next.start.station.id
和next.start.time
lead()
。 You also want to find the time difference using difftime()
. 您还想使用
difftime()
找到时差。 After that you want to remove rows that have a same id for end.station.id
and next.start.station.id
. 之后,您要删除
end.station.id
和next.start.station.id
具有相同ID的行。 Finally, you arrange columns as you wish. 最后,您可以根据需要排列列。
library(dplyr)
foo %>%
arrange(bikeid, starttime) %>% # if necessary, arrange(bikeid, starttime, stoptime)
group_by(bikeid) %>%
mutate(next.start.station.id = lead(start.station.id),
next.start.time = lead(starttime),
diff.time = difftime(next.start.time, stoptime, units = "secs")) %>%
filter(end.station.id != next.start.station.id) %>%
select(bikeid, end.station.id, next.start.station.id, diff.time, stoptime, next.start.time)
bikeid end.station.id next.start.station.id diff.time stoptime next.start.time
<dbl> <int> <int> <time> <dttm> <dttm>
1 1.00 442 422 555201 2017-01-04 08:57:10 2017-01-10 19:10:31
2 1.00 3090 3156 496336 2017-01-10 19:16:02 2017-01-16 13:08:18
3 2.00 3151 3243 746164 2017-01-12 18:36:09 2017-01-21 09:52:13
4 2.00 212 470 415511 2017-01-21 10:21:07 2017-01-26 05:46:18
5 3.00 3112 351 1587161 2017-01-12 08:58:42 2017-01-30 17:51:23
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.