简体   繁体   中英

fuzzy string matching with agrep()

I´m matching a list of company names against itself with R and agrep() because the data was stored wrong in a legacy system - No 4th normal form, companys were recorded on the same level as customers, which means a new company entry for every new customer, which leads to a lot of differenty company names for one company - which works fine in a lot of cases.

Sometimes, especially for short strings, I get - at least for me - weird matches, for example (ABC is the first company name):

ABC ABAXIS Europe GmbH

ABC ABB Europe

ABC ABB Group

ABC ABB Stotz Kontakt GmbH

ABC ABM Financial News

ABC ABN AMRO Bank NV

ABC AC Klöser GmbH

ABC ACCBank

ABC ACEA S.p.A.

I´m using agrep() with the following parameters:

agrep(vector1, vector2, value = TRUE, ignore.case = FALSE, max.distance = 0.01)

Is there any other way than the max distance to tweak agrep() or a better way to do this?

Thanks in advance

For a similar problem, I used the second method described in this article: http://bigdata-doctor.com/fuzzy-string-matching-survival-skill-tackle-unstructured-information-r/#comment-942

It matches each register with the most similar one, which of course is not optimal if having some false positives is a problem for you.

Additionally, you may find useful this function to remove white spaces before and after the names:

  trim <- function (x) gsub("^\\s+|\\s+$", "", x) #Defining function that returns string w/o leading or trailing whitespace

I also used the removewords() function from the "tm" package. In your case, removing ABC " may be useful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM