简体   繁体   中英

Add column from another data.frame based on multiple criteria

I have 2 data frames:

cars = data.frame(car_id=c(1,2,2,3,4,5,5), 
                  max_speed=c(150,180,185, 200, 210, 230,235),
                  since=c('2000-01-01', '2000-01-01', '2007-10-01', '2000-01-01', '2000-01-01', '2000-01-01', '2009-11-18'))

voyages = data.frame(voy_id=c(1234,1235,1236,1237,1238),
                     car_id=c(1,2,3,4,5), 
                     date=c('2000-01-01', '2002-02-02', '2003-03-03', '2004-04-04', '2010-05-05'))

If you look closely you can see that the cars occasionally has multiple entries for a car_id because the manufacturer decided to increase the max speed of that make. Each entry has a date marked by since that indicates the date from which the actual max speed is applied.

My goal: I want to add the max_speed variable to the voyages data frame based on the values found in cars . I can't just join the 2 data frames by car_id because I also have to check the date in voyages and compare it to since in cars to determine the proper max_speed

Question: What is the elegant way to do this without loops?

One approach:

Merge the two datasets, including duplicated observations in "cars". Drop any observations where the date for "since" is later than the date for "date". Order the dataset so most recent dates are first, then drop duplicated observations for "voy_id"--this ensures that where there are two dates in "since", you'll only keep the most recent one that occurs before the voyage date.

z <- merge(cars, voyages, by="car_id")
z <- z[as.Date(z$since)<=as.Date(z$date),]
z <- z[order(as.Date(z$since), decreasing=TRUE),]
z <- z[!duplicated(z$voy_id),]

Also curious to see if someone comes up with a more elegant, parsimonious approach.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM