I have a list of voyages that I need to group using a certain criteria.
Ship| From | To | Departure_From | Departure_To
1| HAMBURG | SETUBAL | 16-09-2018 22:12| 08-10-2018 13:42
1| SETUBAL | NAPOLI | 08-10-2018 13:42| 16-10-2018 00:18
2| HAMBURG | SETUBAL | 14-10-2018 18:30| 07-11-2018 13:55
2| SETUBAL | HAMBURG | 07-11-2018 13:55| 20-11-2018 13:16
3| JEDDAH | ALGECIRAS| 10-05-2018 21:46| 30-05-2018 17:20
3| ALGECIRAS| TANGIER | 30-05-2018 17:20| 31-05-2018 08:41
3| TANGIER | ALGECIRAS| 05-09-2018 21:34| 13-09-2018 22:22
3| ALGECIRAS| TANGIER | 13-09-2018 22:22| 15-09-2018 08:40
4| FOS | ALGECIRAS| 05-09-2018 11:02| 07-09-2018 20:18
4| ALGECIRAS| Baltiysk | 07-09-2018 20:18| 15-09-2018 05:28
4| Baltiysk | GDANSK | 15-09-2018 05:28| 16-09-2018 14:34
Ship column has ship's number, From and To columns are port names, Departure_From is departure from "From" port and Departure_To is departure from "To" port. I need to group this particular data set in the following way: Note, if it is a continuous voyage, then Departure_To date will be the same as Departure_From date of the next entry and so will the port. If it is different then it's a different voyage.
I want the final result to look like this.
Ship| From | To | Departure_From | Departure_To
1| HAMBURG | NAPOLI | 16-09-2018 22:12| 16-10-2018 00:18
2| HAMBURG | HAMBURG | 14-10-2018 18:30| 20-11-2018 13:16
3| JEDDAH | TANGIER | 10-05-2018 21:46| 31-05-2018 08:41
3| TANGIER | TANGIER | 05-09-2018 21:34| 15-09-2018 08:40
4| FOS | GDANSK | 05-09-2018 11:02| 16-09-2018 14:34
Code to create the above data set.
data.frame(Ship= c(1,1,2,2,3,3,3,3,4,4,4),
From=c("HAMBURG","SETUBAL","HAMBURG","SETUBAL","JEDDAH","ALGECIRAS","TANGIER","ALGECIRAS","FOS SUR MER","ALGECIRAS","Baltiysk"),
To= c("SETUBAL","NAPOLI","SETUBAL","HAMBURG","ALGECIRAS","TANGIER","ALGECIRAS","TANGIER","ALGECIRAS","Baltiysk","GDANSK"),
Departure_From= c("16-09-2018 22:12:00",
"08-10-2018 13:42:00",
"14-10-2018 18:30:00",
"07-11-2018 13:55:00",
"10-05-2018 21:46:00",
"30-05-2018 17:20:00",
"05-09-2018 21:34:00",
"13-09-2018 22:22:00",
"05-09-2018 11:02:00",
"07-09-2018 20:18:00",
"15-09-2018 05:28:00"),
Departure_To= c("08-10-2018 13:42:00",
"16-10-2018 00:18:00",
"07-11-2018 13:55:00",
"20-11-2018 13:16:00",
"30-05-2018 17:20:00",
"31-05-2018 08:41:00",
"13-09-2018 22:22:00",
"15-09-2018 08:40:00",
"07-09-2018 20:18:00",
"15-09-2018 05:28:00",
"16-09-2018 14:34:00"
))
Any help would be highly appreciated. (Would prefer to do it in Tidyverse, since I'm comfortable with that)
The trick with creating grouping ids is to use cumsum
with dplyr::lag
(or lead
) and figure out how to make only the rows where you want a new group to start to evaluate to TRUE
. Here we want to mark a new trip if it has a different Departure_From
than the previous row's Departure_To
. If it is the first row for that ship, it will automatically be different because we set default = ""
.
After we have the trip number for each ship, it's easy to summarise
to get the first and last values for each trip respectively. Note that your provided data calls the city FOS SUR MER
.
library(tidyverse)
tbl <- tibble(Ship = c(1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4), From = c("HAMBURG", "SETUBAL", "HAMBURG", "SETUBAL", "JEDDAH", "ALGECIRAS", "TANGIER", "ALGECIRAS", "FOS SUR MER", "ALGECIRAS", "Baltiysk"), To = c("SETUBAL", "NAPOLI", "SETUBAL", "HAMBURG", "ALGECIRAS", "TANGIER", "ALGECIRAS", "TANGIER", "ALGECIRAS", "Baltiysk", "GDANSK"), Departure_From = c("16-09-2018 22:12:00", "08-10-2018 13:42:00", "14-10-2018 18:30:00", "07-11-2018 13:55:00", "10-05-2018 21:46:00", "30-05-2018 17:20:00", "05-09-2018 21:34:00", "13-09-2018 22:22:00", "05-09-2018 11:02:00", "07-09-2018 20:18:00", "15-09-2018 05:28:00"), Departure_To = c("08-10-2018 13:42:00", "16-10-2018 00:18:00", "07-11-2018 13:55:00", "20-11-2018 13:16:00", "30-05-2018 17:20:00", "31-05-2018 08:41:00", "13-09-2018 22:22:00", "15-09-2018 08:40:00", "07-09-2018 20:18:00", "15-09-2018 05:28:00", "16-09-2018 14:34:00"))
tbl %>%
group_by(Ship) %>%
mutate(trip_num = cumsum(Departure_From != lag(Departure_To, default = ""))) %>%
group_by(Ship, trip_num) %>%
summarise(
From = first(From),
To = last(To),
Departure_From = first(Departure_From),
Departure_To = last(Departure_To)
)
#> # A tibble: 5 x 6
#> # Groups: Ship [4]
#> Ship trip_num From To Departure_From Departure_To
#> <dbl> <int> <chr> <chr> <chr> <chr>
#> 1 1 1 HAMBURG NAPOLI 16-09-2018 22:12:… 16-10-2018 00:18…
#> 2 2 1 HAMBURG HAMBURG 14-10-2018 18:30:… 20-11-2018 13:16…
#> 3 3 1 JEDDAH TANGIER 10-05-2018 21:46:… 31-05-2018 08:41…
#> 4 3 2 TANGIER TANGIER 05-09-2018 21:34:… 15-09-2018 08:40…
#> 5 4 1 FOS SUR MER GDANSK 05-09-2018 11:02:… 16-09-2018 14:34…
Created on 2019-04-25 by the reprex package (v0.2.1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.