简体   繁体   English

按行匹配字符串但忽略字符顺序或特殊字符

[英]Match strings by row but ignore character order or special characters

I have an output like this:我有一个像这样的 output :

library(dplyr)  

Data <- tibble(
      Name1 = c("PlaceA, PlaceB & PlaceC", "PlaceD and PlaceE", "PlaceF.", "PlaceG & PlaceH", "Place K-Place L", "Place M and Place N","PlaceP-PlaceQ"),
      Name2 = c("PlaceB, PlaceA & PlaceC", "PlaceD & PlaceE", "PlaceF","PlaceG & PlaceJ", "Place L-Place K", "Place N and Place M","PlaceP-PlaceR")) 
  

I would like to compare the two columns row by row to see if they are the same, but 1) ignore the order of the words 2) the characters used to separate the words and 3) if an '&' has been used instead of 'and'我想逐行比较两列以查看它们是否相同,但是 1)忽略单词的顺序 2)用于分隔单词的字符和 3)如果使用了 '&' 而不是'和'

With an output like this:使用这样的 output:

Data %>% mutate(Match = c("TRUE","TRUE","TRUE","FALSE","TRUE","TRUE","FALSE"))

I'm sure there must be a way of using stringr to do this, but I can't find it.我确定必须有一种使用stringr的方法来执行此操作,但我找不到它。

Edit @akrun noticing I had made a typo in my dummy data made me think about typos in my real data.编辑@akrun 注意到我在我的虚拟数据中犯了一个错字,这让我想到了我的真实数据中的错字。 If there is only one letter difference (either an additional letter or a mistyped letter in the word) then they are probably the same and should match.如果只有一个字母差异(单词中的附加字母或错误输入的字母),那么它们可能相同并且应该匹配。 If a word has the same letters but in a different order it shouldn't.如果一个单词具有相同的字母但顺序不同,则不应如此。 Something like this:像这样的东西:

Mispellings <- tibble(
      Name1 = c("Location","Place","Racecar"),
      Name2 = c("Locatione","Pluce","Carrace"),
      Match = c("TRUE", "TRUE", "FALSE"))

Can any solution for my original question also deal with this additional scenario?我原来的问题的任何解决方案也可以处理这种额外的情况吗?

One option is to split into list and sort , then do the comparison of list elements一种选择是拆分为 list 和sort ,然后进行列表元素的比较

lst1 <- lapply(strsplit(Data$Name1, "\\s*[,&.-]\\s*|\\s*and\\s*"), sort)
lst2 <- lapply(strsplit(Data$Name2, "\\s*[,&.-]\\s*|\\s*and\\s*"), sort)
mapply(function(x, y) all(x == y), lst1, lst2)
[1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE

Or use setequal或者使用setequal

do.call(mapply, c(FUN = setequal, unname(lapply(Data, 
    function(x) strsplit(x, "\\s*[,&.-]\\s*|\\s*and\\s*")))))
[1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM