简体   繁体   中英

Merge dataframes by rounding the dates

I would like to merge two dataframes according to their dates, but they may have different dates. Basically, when the pair group-date is not perfectly matched, I would like to round the dates so that the values in the second dataframe match the values in the first one with the nearest date possible.

To be clearer, here's an example:

library(dplyr)

data1 <- tibble(
  group = rep(c("A", "B"), each = 3),
  date = c(2002, 2005, 2010, 2001, 2004, 2009),
  variable_1 = c("Thing_1", "Thing_1", "Thing_2", "Thing_1", "Thing_2", "Thing_1")
)

# A tibble: 6 x 3
  group  date variable_1
  <chr> <dbl> <chr>     
1 A      2002 Thing_1   
2 A      2005 Thing_1   
3 A      2010 Thing_2   
4 B      2001 Thing_1   
5 B      2004 Thing_2   
6 B      2009 Thing_1   

data2 <- tibble(
  group = rep(c("A", "B"), each = 2),
  date = c(2007, 2008, 2001, 2010),
  variable_2 = c("Else_1", "Else_2", "Else_2", "Else_1")
)

  group  date variable_2
  <chr> <dbl> <chr>     
1 A      2007 Else_1    
2 A      2008 Else_2    
3 B      2001 Else_2    
4 B      2010 Else_1    

In the group A for example, we can see that the dates are not the same: 2002, 2005 and 2010 for data1 ; 2007 and 2008 for data2 . Therefore, since no perfect match is possible, I would like to "round" the dates. The value when data2$date is 2007 should be matched with the one where data1$date is 2005, since 2005 is the closest value of 2007. Similarly, the value when data2$date is 2008 should be matched with the one where data1$date is 2010.

Same thing for group B.

Here's the expected output:

# A tibble: 6 x 4
  group  date variable_1 variable_2
  <chr> <dbl> <chr>      <chr>     
1 A      2002 Thing_1    NA        
2 A      2005 Thing_1    Else_1    
3 A      2009 Thing_2    Else_2    
4 B      2001 Thing_1    Else_2    
5 B      2004 Thing_2    NA        
6 B      2009 Thing_1    Else_1    

How can I do this?

Using some arithmetics in a Map approach. Since the dates are numeric, rounding them in increments of five is straightforward. We do this in both data frames and use match thereafter.

res <- do.call(rbind, Map(function(x, y) {
  transform(x, variable_2=y$variable_2[
    match(round(x$date / 5)/.2, round(y$date / 5)/.2)
    ])},
  split(data1, data1$group), split(data2, data2$group)))
res
#     group date variable_1 variable_2
# A.1     A 2002    Thing_1       <NA>
# A.2     A 2005    Thing_1     Else_1
# A.3     A 2010    Thing_2     Else_2
# B.4     B 2001    Thing_1     Else_2
# B.5     B 2004    Thing_2       <NA>
# B.6     B 2009    Thing_1     Else_1

you can use data.table package and check for rolling joins,roll="nearest" might help

data1 <- data.table(data1)
data2 <- data.table(data2)
setkey(data1, "date")
setkey(data2, "date")

data_a <- subset(data1,data1$group=="A")
data_b <- subset(data2,data2$group=="A")

data <- data_a[data_b, roll="TRUE"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM