I have 2 data frames:
cars = data.frame(car_id=c(1,2,2,3,4,5,5),
max_speed=c(150,180,185, 200, 210, 230,235),
since=c('2000-01-01', '2000-01-01', '2007-10-01', '2000-01-01', '2000-01-01', '2000-01-01', '2009-11-18'))
voyages = data.frame(voy_id=c(1234,1235,1236,1237,1238),
car_id=c(1,2,3,4,5),
date=c('2000-01-01', '2002-02-02', '2003-03-03', '2004-04-04', '2010-05-05'))
If you look closely you can see that the cars occasionally has multiple entries for a car_id
because the manufacturer decided to increase the max speed of that make. Each entry has a date marked by since that indicates the date from which the actual max speed is applied.
My goal: I want to add the max_speed
variable to the voyages
data frame based on the values found in cars
. I can't just join the 2 data frames by car_id
because I also have to check the date
in voyages
and compare it to since in cars
to determine the proper max_speed
Question: What is the elegant way to do this without loops?
One approach:
Merge the two datasets, including duplicated observations in "cars". Drop any observations where the date for "since" is later than the date for "date". Order the dataset so most recent dates are first, then drop duplicated observations for "voy_id"--this ensures that where there are two dates in "since", you'll only keep the most recent one that occurs before the voyage date.
z <- merge(cars, voyages, by="car_id")
z <- z[as.Date(z$since)<=as.Date(z$date),]
z <- z[order(as.Date(z$since), decreasing=TRUE),]
z <- z[!duplicated(z$voy_id),]
Also curious to see if someone comes up with a more elegant, parsimonious approach.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.