简体   繁体   English

使用agrep()模糊名称匹配更快的R代码用于多种模式......?

[英]Faster R code for fuzzy name matching using agrep() for multiple patterns…?

I'm a bit of an R novice and have been trying to experiment a bit using the agrep function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. 我有点像R新手,并且一直在尝试使用R中的agrep函数进行一些实验。我有一个庞大的客户数据库(150万行),其中我确定有很多重复。 Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. 但是,许多重复项并未使用table()显示,以获得重复精确名称的频率。 Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name. 只是看一些行,我注意到许多重复是“独特的”,因为在名称的拼写中有一个小错过键。

So far, to find all of the duplicates in my data set, I have been using agrep() to accomplish the fuzzy name matching. 到目前为止,为了找到我的数据集中的所有重复项,我一直在使用agrep()来完成模糊名称匹配。 I have been playing around with the max.distance argument in agrep() to return different approximate matches. 我一直在使用agrep()的max.distance参数来返回不同的近似匹配。 I think I have found a happy medium between returning false positives and missing out on true matches. 我想我在回归误报和错过真正的比赛之间找到了一个愉快的媒介。 As the agrep() is limited to matching a single pattern at a time, I was able to find an entry on stack overflow to help me write a sapply code that would allow me to match the data set against numerous patterns. 由于agrep()仅限于一次匹配单个模式,因此我能够在堆栈溢出上找到一个条目,以帮助我编写一个允许我将数据集与众多模式匹配的sapply代码。 Here is the code I am using to loop over numerous patterns as it combs through my data sets for "duplicates". 这是我用来循环遍历众多模式的代码,因为它梳理了我的数据集中的“重复”。

dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name)))

unique$name= this is the unique index I developed that has all of the "patterns" I wish to hunt for in my data set. unique$name=这是我开发的唯一索引,它包含我希望在我的数据集中搜索的所有“模式”。

vf$name= is the column in my data frame that contains all of my customer names.

This coding works well on a small scale of a sample of 600 or so customers and the agrep works fine. 这种编码适用于600个左右的客户的小规模样本,并且agrep工作正常。 My problem is when I attempt to use a unique index of 250K+ names and agrep it against my 1.5 million customers. 我的问题是,当我尝试使用的250K +名称的唯一索引和agrep这对我的150万个客户。 As I type out this question, the code is still running in R and has not yet stopped (we are going on 20 minutes at this point). 当我输入这个问题时,代码仍然在R中运行并且尚未停止(此时我们将持续20分钟)。

Does anyone have any suggestions to speed this up or improve the code that I have used? 有没有人有任何建议来加快速度或改进我使用过的代码? I have not yet tried anything out of the plyr package. 我还没有从plyr包中尝试过任何东西。 Perhaps this might be faster... I am a little unfamiliar though with using the ddply or llply functions. 也许这可能会更快...虽然使用ddplyllply函数我有点不熟悉。

Any suggestions would be greatly appreciated. 任何建议将不胜感激。

I'm so sorry, I missed this last request to post a solution. 我很抱歉,我错过了发布解决方案的最后一个请求。 Here is how I solved my agrep, multiple pattern problem, and then sped things up using parallel processing. 这是我如何解决我的agrep,多模式问题,然后使用并行处理加速。

What I am essentially doing is taking aa whole vector of character strings and then fuzzy matching them against themselves to find out if there are any fuzzy matched duplicate records in the vector. 我基本上做的是采用整个字符串向量,然后将它们与自己进行模糊匹配,以找出向量中是否存在任何模糊匹配的重复记录。

Here I create clusters (twenty of them) that I wish to use in a parallel process created by parSapply 在这里,我创建了我希望在parSapply创建的并行进程中使用的集群(其中20个)

cl<-makeCluster(20)

So let's start with the innermost nesting of the code parSapply. 因此,让我们从代码parSapply的最内层嵌套开始。 This is what allows me to run the agrep() in a paralleled process. 这是允许我在并行进程中运行agrep()的原因。 The first argument is "cl", which is the number of clusters I have specified to parallel process across ,as specified above. 第一个参数是“cl”,它是我指定并行处理的簇的数量,如上所述。

The 2nd argument is the specific vector of patterns I wish to match against. 第二个参数是我希望匹配的模式的特定向量。 The third argument is the actual function I wish to use to do the matching (in this case agrep). 第三个参数是我希望用来进行匹配的实际函数(在这种情况下是agrep)。 The next subsequent arguments are all arguments related to the agrep() that I am using. 接下来的后续参数都是与我正在使用的agrep()相关的所有参数。 I have specified that I want the actual character strings returned (not the position of the strings) using value=T. 我已经指定我想要使用value = T返回的实际字符串(而不是字符串的位置)。 I have also specified my max.distance I am willing to accept in a fuzzy match... in this case a cost of 2. The last argument is the full list of patterns I wish to be matched against the first list of patterns (argument 2). 我还指定了我的max.distance我愿意接受模糊匹配...在这种情况下成本为2.最后一个参数是我希望与第一个模式列表匹配的完整模式列表(参数2)。 As it so happens, I am looking to identify duplicates, hence I match the vector against itself. 当它发生时,我正在寻找重复,因此我将矢量与自身相匹配。 The final output is a list, so I use unlist() and then data frame it to basically get a table of matches. 最后的输出是一个列表,所以我使用unlist()然后数据框架它基本上得到一个匹配表。 From there, I can easily run a frequency table of the table I just created to find out, what fuzzy matched character strings have a frequency greater than 1, ultimately telling me that such a pattern match against itself and one other pattern in the vector. 从那里,我可以轻松地运行我刚刚创建的表的频率表来查找,模糊匹配的字符串的频率大于1,最终告诉我这样的模式匹配自身和矢量中的另一个模式。

truedupevf<-data.frame(unlist(parSapply(cl,
                                     s4dupe$fuzzydob,agrep,value=T,
                                     max.distance=2,s4dupe$fuzzydob)))

I hope this helps. 我希望这有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM