数据帧行之间的时差

Question

我已经在StackOverflow的R部分中分区了很长一段时间，寻找一个正确的答案，但没有看到什么看起来似乎适用于我的问题。 我有一个这种格式的数据集（我已经将它调整为最简单的处理方式，但stop_sequence值通常只是每个停止点的增量数字）：

route_short_name    trip_id                     direction_id    departure_time  stop_sequence 
 33A                1.1598.0-33A-b12-1.451.I            1       16:15:00         start
 33A                1.1598.0-33A-b12-1.451.I            1       16:57:00           end
 41C                10.3265.0-41C-b12-1.277.I           1       08:35:00         start
 41C                10.3265.0-41C-b12-1.277.I           1       09:26:00           end
 41C                100.3260.0-41C-b12-1.276.I          1       09:40:00         start
 41C                100.3260.0-41C-b12-1.276.I          1       10:53:00           end
 114                1000.987.0-114-b12-1.86.O           0       21:35:00         start
 114                1000.987.0-114-b12-1.86.O           0       22:02:00           end
 39                 10000.2877.0-39-b12-1.242.I         1       11:15:00         start
 39                 10000.2877.0-39-b12-1.242.I         1       12:30:00           end

它基本上是一个公共汽车旅行数据集。 我想要的是设法获得每次旅行的持续时间，所以这样的事情：

route_short_name    trip_id                    direction_id    duration
33A                1.1598.0-33A-b12-1.451.I            1        42
41C                10.3265.0-41C-b12-1.277.I           1        51
41C                100.3260.0-41C-b12-1.276.I          1        73
114                1000.987.0-114-b12-1.86.O           0        27
39                 10000.2877.0-39-b12-1.242.I         1        75

我已经尝试了很多东西，但在任何情况下我都没有设法通过trip_id对数据进行分组，然后每次都处理这两个值。 我一定误解了什么，但我不知道是什么。

有人有线索吗？

Answer 1

试试这个。 现在你的数据帧是“长”格式，但以“宽”格式计算时差会很好。 使用tidyverse包中的spread函数可以将数据从长到宽。 从那里，您可以使用mutate函数添加所需的新列。 as.numeric(difftime(end,start))将以分钟为单位保持差异单位。

library(tidyverse)

wide_df <- 
  spread(your_df,key = stop_sequence, value = departure_time) %>% 
  mutate(timediff = as.numeric(difftime(end,start)))

如果您想了解更多关于“整洁”数据（以及spread和gather ）的信息，请参阅这个链接到哈德利的书

Answer 2

我们也可以在不转换为“宽”格式的情况下执行此操作（假设每个'route_short_name'，'trip_id'和'direction_id'的'stop_sequence'为'start'后跟'end'。

将'departure_time'转换为datetime列，按'route_short_name'，'trip_id'和'direction_id'分组，获取last 'departure_time'的difftime与'first''deward_time'的difftime

df1 %>%
    mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
    group_by(route_short_name, trip_id, direction_id) %>%
    summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups:   route_short_name, trip_id [?]
#  route_short_name                     trip_id direction_id duration
#             <chr>                       <chr>        <int>    <dbl>
#1              114   1000.987.0-114-b12-1.86.O            0       27
#2              33A    1.1598.0-33A-b12-1.451.I            1       42
#3               39 10000.2877.0-39-b12-1.242.I            1       75
#4              41C   10.3265.0-41C-b12-1.277.I            1       51
#5              41C  100.3260.0-41C-b12-1.276.I            1       73

数据帧行之间的时差

问题描述

2 个解决方案

解决方案1
1 2017-10-28 00:48:52

解决方案2
1 已采纳 2017-10-28 03:42:49

数据帧行之间的时差

问题描述

2 个解决方案

解决方案1 1 2017-10-28 00:48:52

解决方案2 1 已采纳 2017-10-28 03:42:49

解决方案1
1 2017-10-28 00:48:52

解决方案2
1 已采纳 2017-10-28 03:42:49