简体   繁体   English

使用agrep()进行模糊字符串匹配

[英]fuzzy string matching with agrep()

I´m matching a list of company names against itself with R and agrep() because the data was stored wrong in a legacy system - No 4th normal form, companys were recorded on the same level as customers, which means a new company entry for every new customer, which leads to a lot of differenty company names for one company - which works fine in a lot of cases. 我将公司名称列表与R和agrep()进行自我匹配,因为数据在旧版系统中存储错误-没有第4个正常格式,公司与客户记录在同一级别,这意味着要为每个新客户都会为一个公司带来很多不同的公司名称-在很多情况下都可以正常工作。

Sometimes, especially for short strings, I get - at least for me - weird matches, for example (ABC is the first company name): 有时,尤其是对于短字符串,我会(至少对我而言)会感到奇怪的匹配(例如,ABC是第一个公司名称):

ABC ABAXIS Europe GmbH

ABC ABB Europe

ABC ABB Group

ABC ABB Stotz Kontakt GmbH

ABC ABM Financial News

ABC ABN AMRO Bank NV

ABC AC Klöser GmbH

ABC ACCBank

ABC ACEA S.p.A.

I´m using agrep() with the following parameters: 我正在使用带有以下参数的agrep()

agrep(vector1, vector2, value = TRUE, ignore.case = FALSE, max.distance = 0.01)

Is there any other way than the max distance to tweak agrep() or a better way to do this? 除了调整agrep()的最大距离之外,还有其他方法吗?

Thanks in advance 提前致谢

For a similar problem, I used the second method described in this article: http://bigdata-doctor.com/fuzzy-string-matching-survival-skill-tackle-unstructured-information-r/#comment-942 对于类似的问题,我使用了本文描述的第二种方法: http : //bigdata-doctor.com/fuzzy-string-matching-survival-skill-tackle-unstructured-information-r/#comment-942

It matches each register with the most similar one, which of course is not optimal if having some false positives is a problem for you. 它使每个寄存器与最相似的寄存器匹配,如果存在一些误报,这当然不是最佳选择。

Additionally, you may find useful this function to remove white spaces before and after the names: 此外,您可能会发现此功能对删除名称前后的空格很有用:

  trim <- function (x) gsub("^\\s+|\\s+$", "", x) #Defining function that returns string w/o leading or trailing whitespace

I also used the removewords() function from the "tm" package. 我还使用了“ tm”包中的removewords()函数。 In your case, removing ABC " may be useful. 在您的情况下,删除“ ABC”可能会有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM