简体   繁体   中英

Joining data sets in R where the unique ids have spelling mistakes

Hi I am trying to join two large datasets >10000 entries each. To do this I have created a 'unique ID' - a combination of full name and date of birth which are present in both. However, the datasets have spelling mistakes/ different characters in the IDs so when using left join many won't match. I don't have access to fuzyjoin/ match so can't use this to partially match them. Someone has suggested using adist(). How am I able to use this to match and merge the datasets or to flag ones which are close to matching? As simple as possible please I have only been using R for a few weeks! Examples of code would be amazing

You could just rename them to names that are spelled correctly:

df$correct_spelling <- df$incorrect_spelling

This may a bit of a manual solution, but perhaps a base - R solution would be to look through unique values of the join fields and correct any that are misspelled using the grep() function and creating a crosswalk to merge into the dataframes with misspelled unique IDs. Here's a trivial example of what I mean:

Let's say we have a dataframe of scientists and their year of birth, and then we have a second dataframe with the scientists' names and their field of study, but the "names" column is riddled with spelling errors. Here is the code to make the example dataframes:

##Fake Data##
Names<-c("Jill", "Maria", "Carlos", "DeAndre") #Names
BirthYears<-c(1974, 1980, 1991, 1985) # Birthyears 
Field<-c("Astronomy", "Geology", "Medicine", "Ecology") # Fields of science
Mispelled<-c("Deandre", "Marai", "Jil", "Clarlos")# Names misspelled

##Creating Dataframes##
DF<-data.frame(Names=Names, Birth=BirthYears) # Dataframe with correct spellings
DF2<-data.frame(Names=Mispelled, Field=Field) # Dataframe with incorrect spellings we want to fix and merge

What we can do is find all the unique values of the correctly spelled and the incorrectly spelled versions of the scientists' names using a regular expression replacement function gsub().

Mispelled2<-unique(DF2$Names) # Get unique values of names from misspelled dataframe
Correct<-unique(DF$Names) # Get unique values of names from correctly spelled dataframe

fix<-NULL #blank vector to save results from loop

for(i in 1:length(Mispelled2)){#Looping through all unique mispelled values
  ptn<-paste("^",substring(Mispelled2[i],1,1), "+", sep="") #Creating a regular expression string to find patterns similar to the correct name
  fix[i]<-grep(ptn, Correct, value=TRUE) #Finding and saving replacement name values
}#End loop

You'll have to come up with the regular expressions necessary for your situation, here is a link to a cheatsheet with how to build regular expressions

https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

Now we can make a dataframe crosswalking the misspelled names with the correct spelling ie., Row 1 would have "Deandre" and "DeAndre" Row 2 would have "Jil" and "Jill."

CWX<-data.frame(Name_wrong=Mispelled2, Name_correct=fix)

Finally we merge the crosswalk to the dataframe with the incorrect spellings, and then merge the resultant dataframe to the dataframe with the correct spellings

Mispelled3<-merge(DF2, CWX, by.x="Names", by.y="Name_wrong")
Joined_DF<-merge(DF, Mispelled3[,-1], by.x="Names", by.y="Name_correct")

Here is what I was able to come up with for your question about matching in multiple ways. It's a bit clunky, but it works with this below example data. The trick is making the call to agrep() sensitive enough to not match names that partially match but are truly different, but flexible enough that it allows for partial matches and misspellings:

Example1<-"deborahoziajames04/14/2000"
Example2<-"Somepersonnotdeborah04/15/2002"
Example3<-"AnotherpersonnamedJames01/23/1995"
Misspelled1<-"oziajames04/14/2000"
Misspelled2<-"deborahozia04/14/2000"
Misspelled3<-"deborahoziajames10/14/1990"
Misspelled4<-"personnamedJames"

String<-c(Example1, Example2, Example3)
Misspelled<-c(Misspelled1, Misspelled2, Misspelled3, Misspelled4)

Spell_Correct<-function(String, Misspelled){
  out<-NULL
 for(i in 1:length(Misspelled)){
  
    ptn_front<-paste('^', Misspelled[i], "\\B", sep="") 
    ptn_mid<-paste('\\B', Misspelled[i], "\\B", sep="") 
    ptn_end<-paste('\\B', Misspelled[i], "$", sep="")
  
      ptn<-c(ptn_front, ptn_mid, ptn_end)
    
    Nchar_M<-nchar(Misspelled[i])
    Nchar_S<-nchar(String)
    
      out_front<-agrep(pattern=ptn[1], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
      out_mid<-agrep(pattern=ptn[2], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
      out_end<-agrep(pattern=ptn[3], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
      
      out_test<-list(out_front, out_mid, out_end)
      for (j in 1:length(out_test)){
        if(length(out_test[j])==1)
          use_me<-out_test[j]
      }
      out[i]<-use_me
 }
  return(unlist(out))
}

Spell_Correct(String, Misspelled)

Basically this just repeating the previous answer multiple times by using the loop and tweaking the regular expression to try a beginning, middle, and end call to agrep() . Depending on how bad the misspellings are, you may need to play around with the max.distance and cost arguments. Good Luck. Take Care, -Sean

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM