R：使用 agrep 和 data.table 进行模糊合并

Question

I try to merge two data.tables, but due to different spelling in stock names I lose a substantial number of data points.我尝试合并两个 data.tables，但由于股票名称的不同拼写，我丢失了大量数据点。 Hence, instead of an exact match I was looking into a fuzzy merge.因此，我正在研究模糊合并，而不是完全匹配。

library("data.table")
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))

When merging dt1 and dt2 on "Name", ASML HOLDING will be excluded due to the addition of "NV", while the actual data would be accurate.在“Name”上合并 dt1 和 dt2 时，ASML HOLDING 会因添加“NV”而被排除，而实际数据将是准确的。

The prefered final data output would look somthing like:首选的最终数据输出看起来像：

              Name A B
1:  ABN AMRO GROUP 2 q
2: ASML HOLDING NV 1 p

What I tried next was the following:我接下来尝试的是以下内容：

dt1 = dt1[, dt2_NAME := agrep(dt1$Name, dt2$Name, ignore.case = TRUE, value = TRUE, max.distance = 0.05, useBytes = TRUE)]

However, I get the following error,但是，我收到以下错误，

argument 'pattern' has length > 1 and only the first element will be used参数 'pattern' 的长度 > 1，并且只会使用第一个元素

The error makes sense as dt1$Name is longer than 1, but I believe it would be a possible solution if it would consider dt1$Name on a row to row basis.该错误是有道理的，因为 dt1$Name 长于 1，但我相信如果它会在行到行的基础上考虑 dt1$Name，这将是一个可能的解决方案。

It might be a stupid mistake, but for some reason I just can't get my head around it.这可能是一个愚蠢的错误，但出于某种原因，我无法理解它。 Furthermore, I prefer to use data.table as my dataset is fairly large and up till now it has worked splendidly.此外，我更喜欢使用 data.table，因为我的数据集相当大，而且到目前为止它工作得非常好。 Additionally, I am new to stack overflow, so sorry if my question is somewhat off.此外，我是堆栈溢出的新手，如果我的问题有点不对，我很抱歉。

Lastly, I found a piece of code which does the job, but is too slow for practical usage.最后，我找到了一段可以完成这项工作的代码，但对于实际使用来说太慢了。 Fuzzy merge in R R中的模糊合并

dt1$Name_dt2 <- "" # Creating an empty column
for(i in 1:dim(dt1)[1]) {
  x <- agrep(dt1$Name[i], dt2$Name,
             ignore.case=TRUE, value=TRUE,
             max.distance = 0.05, useBytes = TRUE)
  x <- paste0(x,"")
  dt1$Name_dt2[i] <- x
}

Answer 1

A possible solution using 'fuzzyjoin':使用“fuzzyjoin”的可能解决方案：

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
                                   ignore.case=TRUE,
                                   max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
#          Name.x A          Name.y B
#1   ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2  ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector.注意：您也遇到的主要问题是agrep和agrepl似乎不希望第一个参数是向量。 That's the reason why I wrapped the call with Vectorize .这就是为什么我用Vectorize包装电话的原因。

This method can be used together with an equi-join (mind the order of columns in the by !):此方法可以与 equi-join 一起使用（注意by的列顺序！）：

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

R：使用 agrep 和 data.table 进行模糊合并

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-09-19 09:59:42

R：使用 agrep 和 data.table 进行模糊合并

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-09-19 09:59:42

解决方案1
2 已采纳 2018-09-19 09:59:42