简体   繁体   中英

How can I create a grouping variable marking continuous journeys?

I have a list of voyages that I need to group using a certain criteria.

Ship| From     |  To      | Departure_From  | Departure_To
   1| HAMBURG  | SETUBAL  | 16-09-2018 22:12| 08-10-2018 13:42
   1| SETUBAL  | NAPOLI   | 08-10-2018 13:42| 16-10-2018 00:18
   2| HAMBURG  | SETUBAL  | 14-10-2018 18:30| 07-11-2018 13:55
   2| SETUBAL  | HAMBURG  | 07-11-2018 13:55| 20-11-2018 13:16
   3| JEDDAH   | ALGECIRAS| 10-05-2018 21:46| 30-05-2018 17:20
   3| ALGECIRAS| TANGIER  | 30-05-2018 17:20| 31-05-2018 08:41
   3| TANGIER  | ALGECIRAS| 05-09-2018 21:34| 13-09-2018 22:22
   3| ALGECIRAS| TANGIER  | 13-09-2018 22:22| 15-09-2018 08:40
   4| FOS      | ALGECIRAS| 05-09-2018 11:02| 07-09-2018 20:18
   4| ALGECIRAS| Baltiysk | 07-09-2018 20:18| 15-09-2018 05:28
   4| Baltiysk | GDANSK   | 15-09-2018 05:28| 16-09-2018 14:34

Ship column has ship's number, From and To columns are port names, Departure_From is departure from "From" port and Departure_To is departure from "To" port. I need to group this particular data set in the following way: Note, if it is a continuous voyage, then Departure_To date will be the same as Departure_From date of the next entry and so will the port. If it is different then it's a different voyage.

  1. Ship No. 1 departs from Hamburg and goes to Setubal and in the next voyage departs from Setubal and goes to Napoli. Note, Departure_To date of first entry is same as Departure_From date of the next entry and so is the port. Therefore, it is one continuous voyage. I would like to combine this into a single voyage, which is from Hamburg (the first port) to Napoli(the last port) and under Departure_From should be the date of departure Hamburg and Departure_To should be departure date from Napoli.
  2. For ship no. 3, there are two voyages. First voyage is from Jeddah to Algeciras and Algeciras to Tangier (this is one continuous voyage), and the second voyage is from Tangier to Algeciras and Algeciras back to Tangier. So in this case, there should be two groups, one from Jeddah to Tangier and second from Tangier to Tangier.
  3. Case for ship number 4 is a bit more complicated, as the ship starts from Fos and goes to Algeciras, then from Algeciras to Baltiysk and finally, from Baltiysk to GDANSK. In this case 3 voyages should be combined into one (as it is a continuous voyage- To date is same as from date of the next entry), which is from Fos to GDANSK.

I want the final result to look like this.

Ship| From     |  To      | Departure_From  | Departure_To
   1| HAMBURG  | NAPOLI   | 16-09-2018 22:12| 16-10-2018 00:18
   2| HAMBURG  | HAMBURG  | 14-10-2018 18:30| 20-11-2018 13:16
   3| JEDDAH   | TANGIER  | 10-05-2018 21:46| 31-05-2018 08:41
   3| TANGIER  | TANGIER  | 05-09-2018 21:34| 15-09-2018 08:40
   4| FOS      | GDANSK   | 05-09-2018 11:02| 16-09-2018 14:34

Code to create the above data set.

data.frame(Ship= c(1,1,2,2,3,3,3,3,4,4,4), 
           From=c("HAMBURG","SETUBAL","HAMBURG","SETUBAL","JEDDAH","ALGECIRAS","TANGIER","ALGECIRAS","FOS SUR MER","ALGECIRAS","Baltiysk"), 
           To= c("SETUBAL","NAPOLI","SETUBAL","HAMBURG","ALGECIRAS","TANGIER","ALGECIRAS","TANGIER","ALGECIRAS","Baltiysk","GDANSK"), 
           Departure_From= c("16-09-2018  22:12:00",
                "08-10-2018  13:42:00",
                "14-10-2018  18:30:00",
                "07-11-2018  13:55:00",
                "10-05-2018  21:46:00",
                "30-05-2018  17:20:00",
                "05-09-2018  21:34:00",
                "13-09-2018  22:22:00",
                "05-09-2018  11:02:00",
                "07-09-2018  20:18:00",
                "15-09-2018  05:28:00"), 
           Departure_To= c("08-10-2018  13:42:00",
               "16-10-2018  00:18:00",
               "07-11-2018  13:55:00",
               "20-11-2018  13:16:00",
               "30-05-2018  17:20:00",
               "31-05-2018  08:41:00",
               "13-09-2018  22:22:00",
               "15-09-2018  08:40:00",
               "07-09-2018  20:18:00",
               "15-09-2018  05:28:00",
               "16-09-2018  14:34:00"
))

Any help would be highly appreciated. (Would prefer to do it in Tidyverse, since I'm comfortable with that)

The trick with creating grouping ids is to use cumsum with dplyr::lag (or lead ) and figure out how to make only the rows where you want a new group to start to evaluate to TRUE . Here we want to mark a new trip if it has a different Departure_From than the previous row's Departure_To . If it is the first row for that ship, it will automatically be different because we set default = "" .

After we have the trip number for each ship, it's easy to summarise to get the first and last values for each trip respectively. Note that your provided data calls the city FOS SUR MER .

library(tidyverse)
tbl <- tibble(Ship = c(1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4), From = c("HAMBURG", "SETUBAL", "HAMBURG", "SETUBAL", "JEDDAH", "ALGECIRAS", "TANGIER", "ALGECIRAS", "FOS SUR MER", "ALGECIRAS", "Baltiysk"), To = c("SETUBAL", "NAPOLI", "SETUBAL", "HAMBURG", "ALGECIRAS", "TANGIER", "ALGECIRAS", "TANGIER", "ALGECIRAS", "Baltiysk", "GDANSK"), Departure_From = c("16-09-2018  22:12:00", "08-10-2018  13:42:00", "14-10-2018  18:30:00", "07-11-2018  13:55:00", "10-05-2018  21:46:00", "30-05-2018  17:20:00", "05-09-2018  21:34:00", "13-09-2018  22:22:00", "05-09-2018  11:02:00", "07-09-2018  20:18:00", "15-09-2018  05:28:00"), Departure_To = c("08-10-2018  13:42:00", "16-10-2018  00:18:00", "07-11-2018  13:55:00", "20-11-2018  13:16:00", "30-05-2018  17:20:00", "31-05-2018  08:41:00", "13-09-2018  22:22:00", "15-09-2018  08:40:00", "07-09-2018  20:18:00", "15-09-2018  05:28:00", "16-09-2018  14:34:00"))
tbl %>%
  group_by(Ship) %>%
  mutate(trip_num = cumsum(Departure_From != lag(Departure_To, default = ""))) %>%
  group_by(Ship, trip_num) %>%
  summarise(
    From = first(From),
    To = last(To),
    Departure_From = first(Departure_From),
    Departure_To = last(Departure_To)
  )
#> # A tibble: 5 x 6
#> # Groups:   Ship [4]
#>    Ship trip_num From        To      Departure_From      Departure_To      
#>   <dbl>    <int> <chr>       <chr>   <chr>               <chr>             
#> 1     1        1 HAMBURG     NAPOLI  16-09-2018  22:12:… 16-10-2018  00:18…
#> 2     2        1 HAMBURG     HAMBURG 14-10-2018  18:30:… 20-11-2018  13:16…
#> 3     3        1 JEDDAH      TANGIER 10-05-2018  21:46:… 31-05-2018  08:41…
#> 4     3        2 TANGIER     TANGIER 05-09-2018  21:34:… 15-09-2018  08:40…
#> 5     4        1 FOS SUR MER GDANSK  05-09-2018  11:02:… 16-09-2018  14:34…

Created on 2019-04-25 by the reprex package (v0.2.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM