简体   繁体   中英

Merge 2 dataframes by matching dates

I have two dataframes:

id      dates
MUM-1  2015-07-10
MUM-1  2015-07-11
MUM-1  2015-07-12
MUM-2  2014-01-14
MUM-2  2014-01-15
MUM-2  2014-01-16
MUM-2  2014-01-17

and:

id      dates      field1  field2
MUM-1  2015-07-10     1       0
MUM-1  2015-07-12     2       1
MUM-2  2014-01-14     4       3
MUM-2  2014-01-17     0       1

merged data:

id      dates        field1   field2
MUM-1  2015-07-10      1         0
MUM-1  2015-07-11      na        na
MUM-1  2015-07-12      2         1
MUM-2  2014-01-14      4         3
MUM-2  2014-01-15      na        na
MUM-2  2014-01-16      na        na
MUM-2  2014-01-17      0         1   

code: merge(x= df1, y= df2, by= 'id', all.x= T)

I am using merge but since the size of both dataframes are too huge, it is taking too long to process. Is there any alternative to the merge function? Maybe in dplyr? So that it processes fast in comparision. Both dataframes have more than 900K rows.

Instead of using merge with data.table , you can also simply join as follows:

setDT(df1)
setDT(df2)

df2[df1, on = c('id','dates')]

this gives:

> df2[df1]
      id      dates field1 field2
1: MUM-1 2015-07-10      1      0
2: MUM-1 2015-07-11     NA     NA
3: MUM-1 2015-07-12      2      1
4: MUM-2 2014-01-14      4      3
5: MUM-2 2014-01-15     NA     NA
6: MUM-2 2014-01-16     NA     NA
7: MUM-2 2014-01-17      0      1

Doing this with dplyr :

library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))

As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:

dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
                  field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
                  field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]

As can be seen, @Arun's approach is slightly faster:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          jaap = dt2[dt1, on = c('id','dates')],
          pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
          dplr = left_join(dt1, dt2, by=c("id","dates")),
          arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])

  test elapsed relative
4 arun   0.015    1.000
1 jaap   0.016    1.067
3 dplr   0.037    2.467
2 pavo   1.033   68.867

For a comparison on a large dataset, see the answer of @Arun .

I'd update df1 directly by reference as follows:

require(data.table) # v1.9.5+
setDT(df1)[df2, c("fiedl1", "field2") := 
                .(field1, field2), on=c("id", "dates")]

> df1
#       id      dates fiedl1 field2
# 1: MUM-1 2015-07-10      1      0
# 2: MUM-1 2015-07-11     NA     NA
# 3: MUM-1 2015-07-12      2      1
# 4: MUM-2 2014-01-14      4      3
# 5: MUM-2 2014-01-15     NA     NA
# 6: MUM-2 2014-01-16     NA     NA
# 7: MUM-2 2014-01-17      0      1

This'd be very memory efficient (and faster), as it doesn't copy the entire object just to add two columns, rather updates in place.


Updated with a slightly bigger dataset than @Jaap's updated benchmark:

set.seed(1L)
dt1 = CJ(id1 = paste("MUM", 1:1e4, sep = "-"), id2 = sample(1e3L))
dt2 = dt1[sample(nrow(dt1), 1e5L)][, c("field1", "field2") := lapply(c(1e3L, 1e4L), sample, 1e5L, TRUE)][]

# @Jaap's answers
system.time(ans1 <- setDT(dt2)[dt1, on = c('id1','id2')])
#    user  system elapsed 
#   0.209   0.067   0.277 
system.time(ans2 <- left_join(setDF(dt1), setDF(dt2), by = c("id1", "id2")))
#    user  system elapsed 
# 119.911   0.530 120.749 

# this answer    
system.time(ans3 <- setDT(dt1)[dt2, c("field1", "field2") := list(field1, field2), on = c("id1", "id2")])
#    user  system elapsed 
#   0.087   0.013   0.100 

sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.10.4 (Yosemite)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.5 dplyr_0.4.2     

# loaded via a namespace (and not attached):
# [1] magrittr_1.5   R6_2.1.0       assertthat_0.1 parallel_3.2.1 DBI_0.3.1     
# [6] tools_3.2.1    Rcpp_0.12.0    chron_2.3-45  

Dint expect dplyr to be ~1200x slower though.

You can convert both data frames to data tables, and then perform a merge:

library(data.table)
setDT(df1); setDT(df2)

merge(df1, df2, by = "id", allow.cartesian = TRUE)

the allow.cartesian part allows a merge when there are repeated values in the key of any of the merged elements (allowing for a new table length greater than the max of the orignal elements, see ?data.table .

I think the quickest solution at the moment for those cases (huge datasets) is the data.table merging after setting the keys first.

You can also use dplyr 's left_join with data.frames , as mentioned before, but it would be good to compare the same command after transforming your data.frames to data.tables . In other words, use dplyr with a data.table structure in the background.

As an example I'll create two datasets, then save them as a data.frame, data.table with a key and data.table without a key. Then I'll perform various merges and count the time:

library(data.table)
library(dplyr)

# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)

dataT11 = data.table(list, random_number, key="list") # data.table with a key
dataT12 = data.table(list, random_number) # data.table without key
dataF1 = data.frame(list, random_number)

# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)

dataT21 = data.table(list, random_number, key="list")
dataT22 = data.table(list, random_number)
dataF2 = data.frame(list, random_number)


# check your current data tables (note some have keys)
tables()


# merge the datasets as data.frames and count time
ptm <- proc.time()
dataF3 = merge(dataF1, dataF2, all.x=T)
proc.time() - ptm


# merge the datasets as data.tables by setting the key now and count time
ptm <- proc.time()
dataT3 = merge(dataT12, dataT22, all.x=T, by = "list")
proc.time() - ptm


# merge the datasets as data.tables on the key they have already and count time
ptm <- proc.time()
dataT3 = merge(dataT11, dataT21, all.x=T)
proc.time() - ptm

# merge the datasets as data.tables on the key they have already and count time (alternative)
ptm <- proc.time()
dataT3 = dataT11[dataT21]
proc.time() - ptm



# merge the datasets as data.frames using dplyr and count time
ptm <- proc.time()
dataT3 = dataF1 %>% left_join(dataF2, by="list")
proc.time() - ptm


# merge the datasets as data.tables using dplyr and count time
ptm <- proc.time()
dataT3 = dataT11 %>% left_join(dataT21, by="list")
proc.time() - ptm

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM