简体   繁体   中英

Partial or fuzzy match in R

I want to do fuzzy match on 2 dataframes (s1 is data and s2 is reference) based on the 'Answer' column inorder to get corresponding question count and category from s2 . Ex:

s1 <- data.frame(Category =c("Stationary","TransferRelocationClaim","IMS"),
Question =c( "Where do I get stationary items from?","Process for claiming Transfer relocation allowances.","What is IMS?"),Answer = c("Hey <firstname>, you will find it near helpdesk ","Hey <firstname>, moving to new places can be fun! To claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon).","ims or interview management system is a tool that helps interviewers schedule all the interviews"),
stringsAsFactors = FALSE)

s2 <- data.frame(
Question = c("Where to get books?", "Procedure to order stationary?","I would like to know about my relocation and relocation expenses","tell me about relocation expense claiming","how to claim relocation expense","IMS?"),
Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk.", "hey sonali moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon)","hey piyush moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail (from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates.", "hey vibha moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates", "ims or interview management system is a tool that helps interviewers schedule all the interviews")
stringsAsFactors = FALSE)

s1$Response=gsub('[[:punct:] ]+',' ',s1$Response)
s2$Response=gsub('[[:punct:] ]+',' ',s2$Response)
s1$Response <- tolower(s1$Response)
s2$Response <- tolower(s2$Response)
s1$Response<-as.character(s1$Response)
s2$Response<-as.character(s2$Response)
# data =s1, lookup=s2
d.matrix <- stringdistmatrix(a = s2$Response, b = s1$Response, useNames="strings",method="cosine", nthread = getOption("sd_num_thread"))

#list of minimun cosines
cosines<-apply(d.matrix, 2, min)

#return list of the row number of the minimum value
minlist<-apply(d.matrix, 2, which.min) 

#return list of best matching values
matchwith<-s2$Response[minlist]

#below table contains best match and cosines
answer<-data.frame(s1$Response, matchwith, cosines)
t11=merge(x=answer,y=s2, by.x="matchwith", by.y="Response", all.x=TRUE)
View(t11)`

t11表如下图所示 Next, I have to get count of s1.Response = 3 for Question : Process for claiming Transfer relocation allowances? along with Category name. Kindly guide me as to how it can be done.

You could try matching using the agrepl function which lets you set a maximum "distance" which is the sum of "transformations needed to go from a pattern to a target. I would take out the material around the flanking angle brackets with sub :

agrepl(sub("<.+>, ", "", df1$Answer), df2$Answer, 8)
[1]  TRUE  TRUE FALSE

(Note: teh FALSE comes from my having modified the second dataframe so that it had a non-matching "answer" value.

If we slightly modify your first input we can use packages fuzzyjoin / dplyr / stringr the following way :

df1 <- data.frame(
  Category = "Stationary",
  Question = "Where do I get stationary items from?",
  Answer = "Hey <firstname>, you will find it <here>.", # <-notice the change!
  stringsAsFactors = FALSE
)

df2 <- data.frame(
    Category = c("Stat1", "Stat1"),
    Question = c("Where to get books?", "Procedure to order stationary?"),
    Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk."),
    stringsAsFactors = FALSE
  )

We make a regex pattern from Answer :

df1 <- dplyr::mutate(
  df1,
  Answer_regex =gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", Answer), # escape special
  Answer_regex = gsub(" *?<.*?> *?",".*?", Answer_regex), # replace place holders by .*?
  Answer_regex = paste0("^",Answer_regex,"$"))  # make sure the match is exact

We use stringr::str_detect with fuzzyjoin::fuzzy_left_join to find matches :

res <- fuzzyjoin::fuzzy_left_join(df2, df1, by= c(Answer="Answer_regex"), match_fun = stringr::str_detect )
res
#   Category.x                     Question.x                                        Answer.x Category.y
# 1      Stat1            Where to get books?     Hey Anil, you will find it at the helpdesk. Stationary
# 2      Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk. Stationary
#                              Question.y                                  Answer.y                     Answer_regex
# 1 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$
# 2 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$

Then we can count:

dplyr::count(res,Answer.y)
# # A tibble: 1 x 2
#          Answer.y                               n
#          <chr>                              <int>
# 1 Hey <firstname>, you will find it <here>.     2

Note that I included spaces outside of < and > as part of the placeholders. If I didn't do this "Hey, Shekhar" wouldn't have been matched, because of the comma.


edit to address comment :

df1 <- dplyr::mutate(df1, Answer_trimmed = gsub("<.*?>", "", Answer))
res <- fuzzy_left_join(df2, df1, by= c(Answer="Answer_trimmed"), 
                       match_fun = function(x,y) stringdist::stringdist(x, y) / nchar(y) < 0.7)
#   Category.x                     Question.x                                        Answer.x Category.y
# 1      Stat1            Where to get books?     Hey Anil, you will find it at the helpdesk. Stationary
# 2      Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk.       <NA>
#                              Question.y                                Answer.y               Answer_trimmed
# 1 Where do I get stationary items from? Hey <firstname>, you will find it here. Hey , you will find it here.
# 2                                  <NA>                                    <NA>                         <NA>


dplyr::count(res,Answer.y)
# # A tibble: 2 x 2
#   Answer.y                                    n
#   <chr>                                   <int>
# 1 <NA>                                        1
# 2 Hey <firstname>, you will find it here.     1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM