简体   繁体   中英

Alternative approach to using agrep() for fuzzy matching in R

I have a large file of administrative data, about 1 million records. Individual people can be represented multiple times in this dataset. About half the records have an identifying code that maps records to individuals; for the half that don't, I need to fuzzy match names to flag records that potentially belong to the same person.

From looking at the records with the identifying code, I've created a list of differences that have occurred in the recording of names for the same individual:

  • Inclusion of middle name eg Jon Snow vs Jon Targaryen Snow
  • Inclusion of a second last name eg Jon Snow vs Jon Targaryen-Snow
  • Nickname / shortening of first name eg Jonathon Snow vs Jon Snow
  • Reversal of names eg Jon Snow vs Snow Jon
  • Mispellings/typos/variants: eg Samual/Samuel, Monica/Monika, Rafael/Raphael

Given the types of matches I'm after, is there a better approach than using agrep()/levenshtein's distance, that is easily implemented in R?

Edit: agrep() in R doesn't do a very good job with this problem - because of the large number of insertions and substitutions I need to allow to account for the ways names are recorded differently, a lot of false matches are thrown up.

I would make multiple passes.

"Jon .* Snow" - Middle name

"Jon .*Snow" - Second last name

Nicknames will require a dictionary of mappings from long form to short, there's no regular expression that'll handle his.

"Snow Jon" - Reversal (duh)

agrep will handle minor misspellings.

You probably also want to tokenise your names into first-, middle- and last-.

The synthesisr package ( https://cran.r-project.org/web/packages/synthesisr/index.html ) might be helpful. It uses R code to mimic some of the fuzzy matching functionality in the fuzzywuzzy Python package and fuzzywuzzyR. There are different metrics similar taken from fuzzywuzzy; a lower score means a greater similarity. The methods are accessible into different ways as shown below.

Specifically, in this case, the "token" functions might be useful since strings are tokenized by whitespace then alphabetized to deal with situations like reversals.

library(synthesisr)

fuzz_m_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_m_ratio")

fuzz_partial_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_partial_ratio")

fuzz_token_sort_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_token_sort_ratio")

fuzz_token_set_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_token_set_ratio")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM