简体   繁体   中英

R: Fuzzy merge using agrep and data.table

I try to merge two data.tables, but due to different spelling in stock names I lose a substantial number of data points. Hence, instead of an exact match I was looking into a fuzzy merge.

library("data.table")
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))

When merging dt1 and dt2 on "Name", ASML HOLDING will be excluded due to the addition of "NV", while the actual data would be accurate.

The prefered final data output would look somthing like:

              Name A B
1:  ABN AMRO GROUP 2 q
2: ASML HOLDING NV 1 p

What I tried next was the following:

dt1 = dt1[, dt2_NAME := agrep(dt1$Name, dt2$Name, ignore.case = TRUE, value = TRUE, max.distance = 0.05, useBytes = TRUE)]

However, I get the following error,

argument 'pattern' has length > 1 and only the first element will be used

The error makes sense as dt1$Name is longer than 1, but I believe it would be a possible solution if it would consider dt1$Name on a row to row basis.

It might be a stupid mistake, but for some reason I just can't get my head around it. Furthermore, I prefer to use data.table as my dataset is fairly large and up till now it has worked splendidly. Additionally, I am new to stack overflow, so sorry if my question is somewhat off.

Lastly, I found a piece of code which does the job, but is too slow for practical usage. Fuzzy merge in R

dt1$Name_dt2 <- "" # Creating an empty column
for(i in 1:dim(dt1)[1]) {
  x <- agrep(dt1$Name[i], dt2$Name,
             ignore.case=TRUE, value=TRUE,
             max.distance = 0.05, useBytes = TRUE)
  x <- paste0(x,"")
  dt1$Name_dt2[i] <- x
}

A possible solution using 'fuzzyjoin':

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
                                   ignore.case=TRUE,
                                   max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
#          Name.x A          Name.y B
#1   ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2  ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize .

This method can be used together with an equi-join (mind the order of columns in the by !):

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM