简体   繁体   English

data.table合并的“模糊密钥匹配”

[英]“fuzzy key matching” for a data.table merge

I'm trying to match workers from year to year using name strings and a measure of experience. 我正在尝试使用名称字符串和经验来匹配年复一年的工人。 Experience can only increase by at most one from year to year, so I'd like to use this to help matching when other metrics fail. 经验每年最多只能增加一次,所以我想在其他指标失败时使用它来帮助匹配。

For example: 例如:

dt1<-data.table(name=c("jane doe","jane doe",
                       "john doe","jane smith"),
                exp=c(0.,5,1,2),id=1:4,key="name")
dt2<-data.table(name=c("jane doe","jane doe",
                       "john doe","jane smith"),
                exp=c(0,30,1.5,2),key="name")

I want to match the first "jane doe" in dt1 to the first "jane doe" in dt2 . 我希望将dt1中的第一个“jane doe”与dt2中的第一个“jane doe”相匹配。 The latter "jane doe"s don't match, because they're clearly different people (based on vastly different experience levels). 后者“jane doe”不匹配,因为他们显然是不同的人(基于非常不同的经验水平)。

I'd also like to add some flags to know I matched these people in this way later on down the line. 我还想添加一些标志,以便知道我以后会以这种方式匹配这些人。 Here's my first pass: 这是我的第一遍:

dt2[dt1,`:=`(id=ifelse(exp<=i.exp+1,i.id,NA),
             flag=ifelse(exp<=i.exp+1,i.id,NA))]

But this is not working--here's what that gives me: 但这不起作用 - 这就是给我的东西:

> dt2
         name  exp id flag
1:   jane doe  0.0  2    2
2:   jane doe 30.0 NA   NA
3: jane smith  2.0  4    4
4:   john doe  1.5  3    3

It seems properly to have missed matching the latter "jane doe", but appears to have matched the first "jane doe" to the wrong prior "jane doe". 似乎错过了匹配后者的“jane doe”,但似乎已经将第一个“jane doe”与之前错误的“jane doe”相匹配。 I'm not quite sure why this is; 我不太清楚为什么会这样; anyway, it seems preferable to have a way to incorporate the matching on exp before instead of after joining--this would also clean up the ifelse mess in defining the new variables. 无论如何,似乎最好有一种方法在exp 之前而不是在加入之后将匹配结合到exp - 这也将清除定义新变量时的ifelse混乱。 Any suggestions? 有什么建议?


For clarity, here's the desired output: 为清楚起见,这是所需的输出:

> dt2
         name  exp id flag
1:   jane doe  1.0  1    1
2:   jane doe 30.0 NA   NA
3: jane smith  2.0  4    1
4:   john doe  1.5  3    1

In your case the join isn't really "fuzzy". 在你的情况下,连接不是真的“模糊”。 All you trying to do is to join by name by exp while allowing one 1 year distance per match. 您要做的就是按name加入exp同时允许每场比赛一年1。 This is good use for a rolling join with a -1L specification. 这适用于具有-1L规范的滚动连接

First we will correctly key the data sets 首先,我们将正确键入数据集

setkey(dt1, name, exp) 
setkey(dt2, name, exp) 

Then, we will perform the rolling join while passing -1L as its value 然后,我们将执行滚动连接,同时传递-1L作为其值

dt2[dt1, `:=`(id = i.id, flag = 1L), roll = -1L]
df2
#          name  exp id flag
# 1:   jane doe  0.0  1    1
# 2:   jane doe 30.0 NA   NA
# 3: jane smith  2.0  4    1
# 4:   john doe  1.5  3    1

In future, if you''ll need to conduct an interval join such as c(1L, -1L) you can take a look here for some examples of the foverlaps function. 将来,如果您需要进行间隔连接,例如c(1L, -1L)您可以在这里查看foverlaps函数的一些示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM