简体   繁体   English

通过匹配日期合并 2 个数据框

[英]Merge 2 dataframes by matching dates

I have two dataframes:我有两个数据框:

id      dates
MUM-1  2015-07-10
MUM-1  2015-07-11
MUM-1  2015-07-12
MUM-2  2014-01-14
MUM-2  2014-01-15
MUM-2  2014-01-16
MUM-2  2014-01-17

and:和:

id      dates      field1  field2
MUM-1  2015-07-10     1       0
MUM-1  2015-07-12     2       1
MUM-2  2014-01-14     4       3
MUM-2  2014-01-17     0       1

merged data:合并数据:

id      dates        field1   field2
MUM-1  2015-07-10      1         0
MUM-1  2015-07-11      na        na
MUM-1  2015-07-12      2         1
MUM-2  2014-01-14      4         3
MUM-2  2014-01-15      na        na
MUM-2  2014-01-16      na        na
MUM-2  2014-01-17      0         1   

code: merge(x= df1, y= df2, by= 'id', all.x= T)代码: merge(x= df1, y= df2, by= 'id', all.x= T)

I am using merge but since the size of both dataframes are too huge, it is taking too long to process.我正在使用合并,但由于两个数据帧的大小都太大,处理时间太长。 Is there any alternative to the merge function?合并功能有什么替代方法吗? Maybe in dplyr?也许在 dplyr 中? So that it processes fast in comparision.因此它的处理速度比较快。 Both dataframes have more than 900K rows.两个数据帧都有超过 90 万行。

Instead of using merge with data.table , you can also simply join as follows:除了使用mergedata.table ,您还可以简单地加入如下:

setDT(df1)
setDT(df2)

df2[df1, on = c('id','dates')]

this gives:这给出:

> df2[df1]
      id      dates field1 field2
1: MUM-1 2015-07-10      1      0
2: MUM-1 2015-07-11     NA     NA
3: MUM-1 2015-07-12      2      1
4: MUM-2 2014-01-14      4      3
5: MUM-2 2014-01-15     NA     NA
6: MUM-2 2014-01-16     NA     NA
7: MUM-2 2014-01-17      0      1

Doing this with dplyr :使用dplyr执行此dplyr

library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))

As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows.正如@Arun 在评论中所提到的,基准测试在具有 7 行的小型数据集上意义不大。 So lets create some bigger datasets:所以让我们创建一些更大的数据集:

dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
                  field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
                  field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]

As can be seen, @Arun's approach is slightly faster:可以看出,@Arun 的方法稍微快了一点:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          jaap = dt2[dt1, on = c('id','dates')],
          pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
          dplr = left_join(dt1, dt2, by=c("id","dates")),
          arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])

  test elapsed relative
4 arun   0.015    1.000
1 jaap   0.016    1.067
3 dplr   0.037    2.467
2 pavo   1.033   68.867

For a comparison on a large dataset, see the answer of @Arun .有关大型数据集的比较,请参阅@Arun 的答案

I'd update df1 directly by reference as follows:我将通过参考直接更新df1如下:

require(data.table) # v1.9.5+
setDT(df1)[df2, c("fiedl1", "field2") := 
                .(field1, field2), on=c("id", "dates")]

> df1
#       id      dates fiedl1 field2
# 1: MUM-1 2015-07-10      1      0
# 2: MUM-1 2015-07-11     NA     NA
# 3: MUM-1 2015-07-12      2      1
# 4: MUM-2 2014-01-14      4      3
# 5: MUM-2 2014-01-15     NA     NA
# 6: MUM-2 2014-01-16     NA     NA
# 7: MUM-2 2014-01-17      0      1

This'd be very memory efficient (and faster), as it doesn't copy the entire object just to add two columns, rather updates in place.这将非常节省内存(并且速度更快),因为它不会复制整个对象只是为了添加两列,而是就地更新。


Updated with a slightly bigger dataset than @Jaap's updated benchmark:更新了比@Jaap 更新的基准稍大的数据集:

set.seed(1L)
dt1 = CJ(id1 = paste("MUM", 1:1e4, sep = "-"), id2 = sample(1e3L))
dt2 = dt1[sample(nrow(dt1), 1e5L)][, c("field1", "field2") := lapply(c(1e3L, 1e4L), sample, 1e5L, TRUE)][]

# @Jaap's answers
system.time(ans1 <- setDT(dt2)[dt1, on = c('id1','id2')])
#    user  system elapsed 
#   0.209   0.067   0.277 
system.time(ans2 <- left_join(setDF(dt1), setDF(dt2), by = c("id1", "id2")))
#    user  system elapsed 
# 119.911   0.530 120.749 

# this answer    
system.time(ans3 <- setDT(dt1)[dt2, c("field1", "field2") := list(field1, field2), on = c("id1", "id2")])
#    user  system elapsed 
#   0.087   0.013   0.100 

sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.10.4 (Yosemite)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.5 dplyr_0.4.2     

# loaded via a namespace (and not attached):
# [1] magrittr_1.5   R6_2.1.0       assertthat_0.1 parallel_3.2.1 DBI_0.3.1     
# [6] tools_3.2.1    Rcpp_0.12.0    chron_2.3-45  

Dint expect dplyr to be ~1200x slower though. dplyr ,Dint 预计dplyr会慢 1200 倍。

You can convert both data frames to data tables, and then perform a merge:您可以将两个数据框都转换为数据表,然后执行合并:

library(data.table)
setDT(df1); setDT(df2)

merge(df1, df2, by = "id", allow.cartesian = TRUE)

the allow.cartesian part allows a merge when there are repeated values in the key of any of the merged elements (allowing for a new table length greater than the max of the orignal elements, see ?data.table .当任何合并元素的键中存在重复值时, allow.cartesian部分允许合并(允许新表长度大于原始元素的最大值,请参阅?data.table

I think the quickest solution at the moment for those cases (huge datasets) is the data.table merging after setting the keys first.我认为目前对于这些情况(巨大的数据集)最快的解决方案是设置键后合并data.table

You can also use dplyr 's left_join with data.frames , as mentioned before, but it would be good to compare the same command after transforming your data.frames to data.tables .如前所述,您还可以将dplyrleft_joindata.frames left_join使用,但最好在将data.frames转换为data.tables后比较相同的命令。 In other words, use dplyr with a data.table structure in the background.换句话说,使用dplyr在后台一个data.table结构。

As an example I'll create two datasets, then save them as a data.frame, data.table with a key and data.table without a key.作为示例,我将创建两个数据集,然后将它们保存为 data.frame、带键的 data.table 和不带键的 data.table。 Then I'll perform various merges and count the time:然后我将执行各种合并并计算时间:

library(data.table)
library(dplyr)

# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)

dataT11 = data.table(list, random_number, key="list") # data.table with a key
dataT12 = data.table(list, random_number) # data.table without key
dataF1 = data.frame(list, random_number)

# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)

dataT21 = data.table(list, random_number, key="list")
dataT22 = data.table(list, random_number)
dataF2 = data.frame(list, random_number)


# check your current data tables (note some have keys)
tables()


# merge the datasets as data.frames and count time
ptm <- proc.time()
dataF3 = merge(dataF1, dataF2, all.x=T)
proc.time() - ptm


# merge the datasets as data.tables by setting the key now and count time
ptm <- proc.time()
dataT3 = merge(dataT12, dataT22, all.x=T, by = "list")
proc.time() - ptm


# merge the datasets as data.tables on the key they have already and count time
ptm <- proc.time()
dataT3 = merge(dataT11, dataT21, all.x=T)
proc.time() - ptm

# merge the datasets as data.tables on the key they have already and count time (alternative)
ptm <- proc.time()
dataT3 = dataT11[dataT21]
proc.time() - ptm



# merge the datasets as data.frames using dplyr and count time
ptm <- proc.time()
dataT3 = dataF1 %>% left_join(dataF2, by="list")
proc.time() - ptm


# merge the datasets as data.tables using dplyr and count time
ptm <- proc.time()
dataT3 = dataT11 %>% left_join(dataT21, by="list")
proc.time() - ptm

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM