繁体   English   中英

R中的部分或模糊匹配

[英]Partial or fuzzy match in R

我想基于'Answer'列对2个数据帧进行模糊匹配(s1是数据,s2是引用),以便从s2获得相应的问题计数和类别。 例如:

s1 <- data.frame(Category =c("Stationary","TransferRelocationClaim","IMS"),
Question =c( "Where do I get stationary items from?","Process for claiming Transfer relocation allowances.","What is IMS?"),Answer = c("Hey <firstname>, you will find it near helpdesk ","Hey <firstname>, moving to new places can be fun! To claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon).","ims or interview management system is a tool that helps interviewers schedule all the interviews"),
stringsAsFactors = FALSE)

s2 <- data.frame(
Question = c("Where to get books?", "Procedure to order stationary?","I would like to know about my relocation and relocation expenses","tell me about relocation expense claiming","how to claim relocation expense","IMS?"),
Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk.", "hey sonali moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon)","hey piyush moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail (from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates.", "hey vibha moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates", "ims or interview management system is a tool that helps interviewers schedule all the interviews")
stringsAsFactors = FALSE)

s1$Response=gsub('[[:punct:] ]+',' ',s1$Response)
s2$Response=gsub('[[:punct:] ]+',' ',s2$Response)
s1$Response <- tolower(s1$Response)
s2$Response <- tolower(s2$Response)
s1$Response<-as.character(s1$Response)
s2$Response<-as.character(s2$Response)
# data =s1, lookup=s2
d.matrix <- stringdistmatrix(a = s2$Response, b = s1$Response, useNames="strings",method="cosine", nthread = getOption("sd_num_thread"))

#list of minimun cosines
cosines<-apply(d.matrix, 2, min)

#return list of the row number of the minimum value
minlist<-apply(d.matrix, 2, which.min) 

#return list of best matching values
matchwith<-s2$Response[minlist]

#below table contains best match and cosines
answer<-data.frame(s1$Response, matchwith, cosines)
t11=merge(x=answer,y=s2, by.x="matchwith", by.y="Response", all.x=TRUE)
View(t11)`

t11表如下图所示 接下来,我必须得到s1.Response = 3的问题:申请转移搬迁津贴的流程? 以及类别名称。 请指导我如何做到这一点。

您可以尝试使用agrepl函数进行匹配,该函数允许您设置最大“距离”,这是“从模式到目标所需的变换的总和。我将使用sub取出侧翼尖括号周围的材料:

agrepl(sub("<.+>, ", "", df1$Answer), df2$Answer, 8)
[1]  TRUE  TRUE FALSE

(注意:因为我修改了第二个数据帧,因此它有一个不匹配的“回答”值。

如果我们稍微修改你的第一个输入,我们可以通过以下方式使用包fuzzyjoin / dplyr / stringr

df1 <- data.frame(
  Category = "Stationary",
  Question = "Where do I get stationary items from?",
  Answer = "Hey <firstname>, you will find it <here>.", # <-notice the change!
  stringsAsFactors = FALSE
)

df2 <- data.frame(
    Category = c("Stat1", "Stat1"),
    Question = c("Where to get books?", "Procedure to order stationary?"),
    Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk."),
    stringsAsFactors = FALSE
  )

我们从Answer制作正则表达式:

df1 <- dplyr::mutate(
  df1,
  Answer_regex =gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", Answer), # escape special
  Answer_regex = gsub(" *?<.*?> *?",".*?", Answer_regex), # replace place holders by .*?
  Answer_regex = paste0("^",Answer_regex,"$"))  # make sure the match is exact

我们使用stringr::str_detectfuzzyjoin::fuzzy_left_join来查找匹配项:

res <- fuzzyjoin::fuzzy_left_join(df2, df1, by= c(Answer="Answer_regex"), match_fun = stringr::str_detect )
res
#   Category.x                     Question.x                                        Answer.x Category.y
# 1      Stat1            Where to get books?     Hey Anil, you will find it at the helpdesk. Stationary
# 2      Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk. Stationary
#                              Question.y                                  Answer.y                     Answer_regex
# 1 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$
# 2 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$

然后我们可以算:

dplyr::count(res,Answer.y)
# # A tibble: 1 x 2
#          Answer.y                               n
#          <chr>                              <int>
# 1 Hey <firstname>, you will find it <here>.     2

请注意,我在<>之外包含空格作为占位符的一部分。 如果我不这样做, "Hey, Shekhar"就不会匹配,因为逗号。


编辑以发表评论:

df1 <- dplyr::mutate(df1, Answer_trimmed = gsub("<.*?>", "", Answer))
res <- fuzzy_left_join(df2, df1, by= c(Answer="Answer_trimmed"), 
                       match_fun = function(x,y) stringdist::stringdist(x, y) / nchar(y) < 0.7)
#   Category.x                     Question.x                                        Answer.x Category.y
# 1      Stat1            Where to get books?     Hey Anil, you will find it at the helpdesk. Stationary
# 2      Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk.       <NA>
#                              Question.y                                Answer.y               Answer_trimmed
# 1 Where do I get stationary items from? Hey <firstname>, you will find it here. Hey , you will find it here.
# 2                                  <NA>                                    <NA>                         <NA>


dplyr::count(res,Answer.y)
# # A tibble: 2 x 2
#   Answer.y                                    n
#   <chr>                                   <int>
# 1 <NA>                                        1
# 2 Hey <firstname>, you will find it here.     1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM