簡體   English   中英

R中的部分或模糊匹配

[英]Partial or fuzzy match in R

我想基於'Answer'列對2個數據幀進行模糊匹配(s1是數據,s2是引用),以便從s2獲得相應的問題計數和類別。 例如:

s1 <- data.frame(Category =c("Stationary","TransferRelocationClaim","IMS"),
Question =c( "Where do I get stationary items from?","Process for claiming Transfer relocation allowances.","What is IMS?"),Answer = c("Hey <firstname>, you will find it near helpdesk ","Hey <firstname>, moving to new places can be fun! To claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon).","ims or interview management system is a tool that helps interviewers schedule all the interviews"),
stringsAsFactors = FALSE)

s2 <- data.frame(
Question = c("Where to get books?", "Procedure to order stationary?","I would like to know about my relocation and relocation expenses","tell me about relocation expense claiming","how to claim relocation expense","IMS?"),
Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk.", "hey sonali moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon)","hey piyush moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail (from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates.", "hey vibha moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates", "ims or interview management system is a tool that helps interviewers schedule all the interviews")
stringsAsFactors = FALSE)

s1$Response=gsub('[[:punct:] ]+',' ',s1$Response)
s2$Response=gsub('[[:punct:] ]+',' ',s2$Response)
s1$Response <- tolower(s1$Response)
s2$Response <- tolower(s2$Response)
s1$Response<-as.character(s1$Response)
s2$Response<-as.character(s2$Response)
# data =s1, lookup=s2
d.matrix <- stringdistmatrix(a = s2$Response, b = s1$Response, useNames="strings",method="cosine", nthread = getOption("sd_num_thread"))

#list of minimun cosines
cosines<-apply(d.matrix, 2, min)

#return list of the row number of the minimum value
minlist<-apply(d.matrix, 2, which.min) 

#return list of best matching values
matchwith<-s2$Response[minlist]

#below table contains best match and cosines
answer<-data.frame(s1$Response, matchwith, cosines)
t11=merge(x=answer,y=s2, by.x="matchwith", by.y="Response", all.x=TRUE)
View(t11)`

t11表如下圖所示 接下來,我必須得到s1.Response = 3的問題:申請轉移搬遷津貼的流程? 以及類別名稱。 請指導我如何做到這一點。

您可以嘗試使用agrepl函數進行匹配,該函數允許您設置最大“距離”,這是“從模式到目標所需的變換的總和。我將使用sub取出側翼尖括號周圍的材料:

agrepl(sub("<.+>, ", "", df1$Answer), df2$Answer, 8)
[1]  TRUE  TRUE FALSE

(注意:因為我修改了第二個數據幀,因此它有一個不匹配的“回答”值。

如果我們稍微修改你的第一個輸入,我們可以通過以下方式使用包fuzzyjoin / dplyr / stringr

df1 <- data.frame(
  Category = "Stationary",
  Question = "Where do I get stationary items from?",
  Answer = "Hey <firstname>, you will find it <here>.", # <-notice the change!
  stringsAsFactors = FALSE
)

df2 <- data.frame(
    Category = c("Stat1", "Stat1"),
    Question = c("Where to get books?", "Procedure to order stationary?"),
    Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk."),
    stringsAsFactors = FALSE
  )

我們從Answer制作正則表達式:

df1 <- dplyr::mutate(
  df1,
  Answer_regex =gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", Answer), # escape special
  Answer_regex = gsub(" *?<.*?> *?",".*?", Answer_regex), # replace place holders by .*?
  Answer_regex = paste0("^",Answer_regex,"$"))  # make sure the match is exact

我們使用stringr::str_detectfuzzyjoin::fuzzy_left_join來查找匹配項:

res <- fuzzyjoin::fuzzy_left_join(df2, df1, by= c(Answer="Answer_regex"), match_fun = stringr::str_detect )
res
#   Category.x                     Question.x                                        Answer.x Category.y
# 1      Stat1            Where to get books?     Hey Anil, you will find it at the helpdesk. Stationary
# 2      Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk. Stationary
#                              Question.y                                  Answer.y                     Answer_regex
# 1 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$
# 2 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$

然后我們可以算:

dplyr::count(res,Answer.y)
# # A tibble: 1 x 2
#          Answer.y                               n
#          <chr>                              <int>
# 1 Hey <firstname>, you will find it <here>.     2

請注意,我在<>之外包含空格作為占位符的一部分。 如果我不這樣做, "Hey, Shekhar"就不會匹配,因為逗號。


編輯以發表評論:

df1 <- dplyr::mutate(df1, Answer_trimmed = gsub("<.*?>", "", Answer))
res <- fuzzy_left_join(df2, df1, by= c(Answer="Answer_trimmed"), 
                       match_fun = function(x,y) stringdist::stringdist(x, y) / nchar(y) < 0.7)
#   Category.x                     Question.x                                        Answer.x Category.y
# 1      Stat1            Where to get books?     Hey Anil, you will find it at the helpdesk. Stationary
# 2      Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk.       <NA>
#                              Question.y                                Answer.y               Answer_trimmed
# 1 Where do I get stationary items from? Hey <firstname>, you will find it here. Hey , you will find it here.
# 2                                  <NA>                                    <NA>                         <NA>


dplyr::count(res,Answer.y)
# # A tibble: 2 x 2
#   Answer.y                                    n
#   <chr>                                   <int>
# 1 <NA>                                        1
# 2 Hey <firstname>, you will find it here.     1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM