简体   繁体   English

做一个“模糊”和非模糊,多对一与 data.table 合并

[英]Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table

Lets assume I have two databases dfA and dfB .假设我有两个数据库dfAdfB One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode.一个有个人观察,一个有国家级数据(适用于来自同一年份和国家的多个观察)对于这些数据库中的每一个,我都创建了一个名为 matchcode 的键。 This matchcode is a combination of a country code and a year.此匹配代码是国家/地区代码和年份的组合。

   dfA <- read.table(
  text = "A   B   C   D   E   F   G   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10  1   0   1   0   0   1   0   BEL   2010   BEL2010
  11  0   1   1   1   0   1   0   NLD   2010   NLD2010
  12  1   0   0   0   1   0   1   NLD   2014   NLD2014
  13  0   0   0   1   1   0   0   AUS   2010   AUS2010
  14  1   0   1   0   0   1   0   AUS   2006   AUS2006
  15  0   1   0   1   0   1   1   USA   2008   USA2008
  16  0   0   1   0   0   0   1   USA   2010   USA2010
  17  0   1   0   1   0   0   0   USA   2012   USA2012
  18  1   0   1   0   0   1   0   BLG   2008   BLG2008
  19  0   1   0   1   1   0   1   BEL   2008   BEL2008
  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE
)

   dfB <- read.table(
  text = "A   B   C   D   H   I   J   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2011   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

library(data.table)
setDT(dfA)
setDT(dfB)

Mostly when I merge these datasets I simply do:大多数情况下,当我合并这些数据集时,我只是这样做:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

The problem is that sometimes the years do not completely match.问题是有时年份并不完全匹配。 So I tried:所以我试过:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

But this reduces the amount of observations to 11.但这将观察量减少到 11。

# A tibble: 11 x 18
       A     B     C     D     E     F     G iso    year matchcode     K     L     M     N     O     P     Q i.matchcode
   <int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct>     <int> <int> <int> <int> <int> <int> <int> <fct>      
 1     0     1     1     1     0     1     0 NLD    2009 NLD2010       0     1     1     1     0     1     0 NLD2009    
 2     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    
 3     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    
 4     0     0     0     1     1     0     0 AUS    2011 AUS2010       0     0     0     1     1     0     0 AUS2011    
 5     1     0     1     0     0     1     0 AUS    2007 AUS2006       1     0     1     0     0     1     0 AUS2007    
 6     0     1     0     1     0     1     1 USA    2007 USA2008       0     1     0     1     0     1     1 USA2007    
 7     0     0     1     0     0     0     1 USA    2011 USA2010       0     0     1     0     0     0     1 USA2010    
 8     0     1     0     1     0     0     0 USA    2013 USA2012       0     1     0     1     0     0     0 USA2013    
 9     1     0     1     0     0     1     0 BLG    2007 BLG2008       1     0     1     0     0     1     0 BLG2007    
10     0     1     0     1     1     0     1 BEL    2009 BEL2008       0     1     0     1     1     0     1 BEL2009    
11     1     0     1     0     0     1     0 BEL    2012 BEL2010       1     0     1     0     0     1     0 BEL2012   

The preferred output would be as follows:首选输出如下:

#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009
#11: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
#12: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
#13: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
#14: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
#15: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
#16: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
#17: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
#18: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
#19: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#20: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

Additional Sources:其他来源:

1. The previous try 1.上次尝试

2. The try before that 2. 之前的尝试

Hers is my (default) approach for a join like this, using data.table Hers 是我使用data.table进行这样的连接的(默认)方法

code代码

library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)

result结果

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009
#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014
#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011
#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007
#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007
#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010
#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010
#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007
#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009
# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009
# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009
# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014
# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011
# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007
# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007
# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010
# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010
# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007
# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009
# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

sample data样本数据

dfA <- fread(
  "A   B   C   D   E   F   G   iso   year   matchcode
  0   1   1   1   0   1   0   NLD   2010   NLD2010
     1   0   0   0   1   0   1   NLD   2014   NLD2014
     0   0   0   1   1   0   0   AUS   2010   AUS2010
     1   0   1   0   0   1   0   AUS   2006   AUS2006
     0   1   0   1   0   1   1   USA   2008   USA2008
     0   0   1   0   0   0   1   USA   2010   USA2010
     0   1   0   1   0   0   0   USA   2012   USA2012
     1   0   1   0   0   1   0   BLG   2008   BLG2008
     0   1   0   1   1   0   1   BEL   2008   BEL2008
    1   0   1   0   0   1   0   BEL   2010   BEL2010
    0   1   1   1   0   1   0   NLD   2010   NLD2010
    1   0   0   0   1   0   1   NLD   2014   NLD2014
    0   0   0   1   1   0   0   AUS   2010   AUS2010
    1   0   1   0   0   1   0   AUS   2006   AUS2006
    0   1   0   1   0   1   1   USA   2008   USA2008
    0   0   1   0   0   0   1   USA   2010   USA2010
    0   1   0   1   0   0   0   USA   2012   USA2012
    1   0   1   0   0   1   0   BLG   2008   BLG2008
    0   1   0   1   1   0   1   BEL   2008   BEL2008
    1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE
)


dfB <- fread(
  "A   B   C   D   H   I   J   iso   year   matchcode
     0   1   1   1   0   1   0   NLD   2009   NLD2009
     1   0   0   0   1   0   1   NLD   2014   NLD2014
     0   0   0   1   1   0   0   AUS   2011   AUS2011
     1   0   1   0   0   1   0   AUS   2007   AUS2007
     0   1   0   1   0   1   1   USA   2007   USA2007
     0   0   1   0   0   0   1   USA   2011   USA2010
     0   1   0   1   0   0   0   USA   2013   USA2013
     1   0   1   0   0   1   0   BLG   2007   BLG2007
     0   1   0   1   1   0   1   BEL   2009   BEL2009
     1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM