简体   繁体   English

"在 R 中使用 agrep() 进行模糊匹配的替代方法"

[英]Alternative approach to using agrep() for fuzzy matching in R

I have a large file of administrative data, about 1 million records.我有一个大的行政数据文件,大约 100 万条记录。 Individual people can be represented multiple times in this dataset.在这个数据集中,个人可以多次表示。 About half the records have an identifying code that maps records to individuals;大约一半的记录具有将记录映射到个人的识别代码; for the half that don't, I need to fuzzy match names to flag records that potentially belong to the same person.对于没有的那一半,我需要模糊匹配名称以标记可能属于同一个人的记录。

From looking at the records with the identifying code, I've created a list of differences that have occurred in the recording of names for the same individual:通过查看带有识别码的记录,我创建了一个在同一个人的姓名记录中出现的差异列表:

  • Inclusion of middle name eg Jon Snow vs Jon Targaryen Snow包含中间名,例如 Jon Snow vs Jon Targaryen Snow
  • Inclusion of a second last name eg Jon Snow vs Jon Targaryen-Snow包含第二个姓氏,例如 Jon Snow vs Jon Targaryen-Snow
  • Nickname / shortening of first name eg Jonathon Snow vs Jon Snow昵称/名字的缩写,例如 Jonathon Snow vs Jon Snow
  • Reversal of names eg Jon Snow vs Snow Jon姓名颠倒,例如 Jon Snow vs Snow Jon
  • Mispellings/typos/variants: eg Samual/Samuel, Monica/Monika, Rafael/Raphael拼写错误/错别字/变体:例如 Samual/Samuel、Monica/Monika、Rafael/Raphael

Given the types of matches I'm after, is there a better approach than using agrep()/levenshtein's distance, that is easily implemented in R?鉴于我所追求的匹配类型,有没有比使用 agrep()/levenshtein 的距离更好的方法,这在 R 中很容易实现?

Edit: agrep() in R doesn't do a very good job with this problem - because of the large number of insertions and substitutions I need to allow to account for the ways names are recorded differently, a lot of false matches are thrown up.编辑:R 中的 agrep() 不能很好地解决这个问题 - 因为我需要允许大量的插入和替换来解释名称记录方式的不同,所以会抛出很多错误的匹配项.

I would make multiple passes.我会多次传球。

"Jon .* Snow" - Middle name "Jon .* Snow" - 中间名

"Jon .*Snow" - Second last name "Jon .*Snow" - 第二个姓氏

Nicknames will require a dictionary of mappings from long form to short, there's no regular expression that'll handle his.昵称需要一个从长到短的映射字典,没有正则表达式可以处理他的。

"Snow Jon" - Reversal (duh) "Snow Jon" ——逆转(duh)

agrep will handle minor misspellings. agrep 将处理轻微的拼写错误。

You probably also want to tokenise your names into first-, middle- and last-.您可能还想将您的名字标记为名字、中间和姓氏。

The synthesisr package ( https://cran.r-project.org/web/packages/synthesisr/index.html ) might be helpful.合成器包( https://cran.r-project.org/web/packages/synthesisr/index.html )可能会有所帮助。 It uses R code to mimic some of the fuzzy matching functionality in the fuzzywuzzy Python package and fuzzywuzzyR.它使用 R 代码来模仿fuzzywuzzy Python 包和fuzzywuzzyR 中的一些模糊匹配功能。 There are different metrics similar taken from fuzzywuzzy;从fuzzywuzzy 中获取了类似的不同指标; a lower score means a greater similarity.较低的分数意味着更大的相似性。 The methods are accessible into different ways as shown below.这些方法可以通过不同的方式访问,如下所示。

Specifically, in this case, the "token" functions might be useful since strings are tokenized by whitespace then alphabetized to deal with situations like reversals.具体来说,在这种情况下,“标记”函数可能很有用,因为字符串由空格标记,然后按字母顺序排列以处理反转等情况。

library(synthesisr)

fuzz_m_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_m_ratio")

fuzz_partial_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_partial_ratio")

fuzz_token_sort_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_token_sort_ratio")

fuzz_token_set_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_token_set_ratio")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM