简体   繁体   中英

Sum values based on values of two columns in R

I am currently working on an air traffic dataset that contains origins, destinations and some other air traffic related info. However, for my analysis, I would like to combine information as long as the flights go between the same two cities.

For example, the data of flights from Seattle to Portland need to be combined with the data of flights from Portland to Seattle.

Here is a sample of the dataset:

airtravel <- structure(list(CARRIER = structure(c(6L, 13L, 6L, 1L, 1L, 13L, 
17L, 17L, 13L, 13L, 13L, 13L, 2L, 1L, 13L), .Label = c("9E", 
"AA", "AS", "B6", "DL", "EV", "F9", "G4", "HA", "MQ", "NK", "OH", 
"OO", "UA", "WN", "YV", "YX"), class = "factor"), OD = c("DCA - ORD", 
"PDX - SEA", "ORD - DCA", "CHA - ATL", "ATL - CHA", "ELM - DTW", 
"LGA - RIC", "RIC - LGA", "DTW - ELM", "BZN - SEA", "SEA - BZN", 
"SEA - PDX", "DCA - LGA", "AVL - ATL", "SFO - SNA"), diff = c(164, 158, 146, 
    142, 141, 138, 138, 138, 136, 130, 130, 130, 127, 124, 124
    )), row.names = c(2983L, 7423L, 3217L, 115L, 17L, 6737L, 
11042L, 11315L, 6669L, 6370L, 7624L, 7636L, 685L, 66L, 7693L), class = "data.frame")

I would like to sum up the diff of rows that involve the same two cities. Could someone shed some light on how to solve this?

Thanks in advance!

You can divide OD column to source and destination based on '-' separator between them, rowwise sort them using pmin and pmax and get sum of diff .

library(dplyr)

airtravel %>%
  tidyr::separate(OD, c('source', 'destination'), sep = '\\s*-\\s*') %>%
  group_by(grp = pmin(source, destination), grp2 = pmax(source, destination)) %>%
  summarise(diff = sum(diff))


#  grp   grp2   diff
#  <chr> <chr> <dbl>
#1 ATL   AVL     124
#2 ATL   CHA     283
#3 BZN   SEA     260
#4 DCA   LGA     127
#5 DCA   ORD     310
#6 DTW   ELM     274
#7 LGA   RIC     276
#8 PDX   SEA     288
#9 SFO   SNA     124

If you want to keep more columns you can add them in group_by .

We can use base R to do this by splitting the 'OD' column and then sort to be used as grouping variable in aggregate

aggregate(airtravel$diff, list(OD = sapply(strsplit(airtravel$OD, "\\s*-\\s*"), 
      function(x) paste(sort(x), collapse=" - "))), FUN = sum)
#         OD   x
#1 ATL - AVL 124
#2 ATL - CHA 283
#3 BZN - SEA 260
#4 DCA - LGA 127
#5 DCA - ORD 310
#6 DTW - ELM 274
#7 LGA - RIC 276
#8 PDX - SEA 288
#9 SFO - SNA 124

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM