简体   繁体   English

与R中的单词顺序和简短形式无关的字符串匹配:R中的模糊字符串匹配

[英]string matching irrespective of order of words and short forms in R : fuzzy string matching in R

I am new to R, and want to compare 2 strings(addresses) where 我是R的新手,想比较2个字符串(地址),其中

  1. Word order could be different, other than numbers. 单词顺序可能不同于数字。 (Consecutive numbers need to be in same order) (连续数字必须按相同顺序排列)
  2. Words could be at times in short form, eg street could be st., North West could be North W. 单词有时可能是简短的形式,例如street可以是st。,North West可以是NorthW。
  3. 1 string could contain a word or 2 extra(rest of the words would be same) 1个字符串可以包含一个单词或2个额外的单词(其余单词相同)
  4. There sometimes could be space in a word in 1 of the srings eg Pitampura -> Pitam pura. 有时在其中一个字串中的单词中可能会有空格,例如Pitampura-> Pitam pura。

    eg 例如

S1 = QU 23/24 Shalimar Bagh, Pitampura, Street no. S1 = QU 23/24 Shalimar Bagh,Pitampura,街号 22, delhi 22,德里

S2 = QU Flat 23/24 Pitam Pura, St. No. 22, Shalimar Bagh, Delhi S2 = QU Flat 23/24 Pitam Pura,St.No.22,Shalimar Bagh,德里

So far, I have removed the special characters, whitespaces, redundant words in the address. 到目前为止,我已经删除了地址中的特殊字符,空格和多余的单词。

Would a distance formula like cosine or levenshtein distance, be a good choice. 像余弦或levenshtein距离这样的距离公式将是一个不错的选择。 If yes, how to apply the same in R without using any package . 如果是,如何在不使用任何包的情况下在R中应用它。

Don't have liberty to install any external package. 没有安装任何外部软件包的自由。

Thanks in advance. 提前致谢。

Not a direct answer but an idea: you could calculate a score of the splitted lowercase words which occur in the other vector and establish some kind of threshold. 这不是一个直接的答案,而是一个主意:您可以计算出现在另一个向量中的分割小写单词的分数,并建立某种阈值。 In R this could be: R可能是:

S1 <- "QU 23/24 Shalimar Bagh, Pitampura, Street no. 22, delhi"
lcwords1 <- tolower(unlist(strsplit(S1, " ")))

S2 <- "QU Flat 23/24 Pitam Pura, St. No. 22, Shalimar Bagh, Delhi"
lcwords2 <- tolower(unlist(strsplit(S2, " ")))

(score <- sum(lcwords1 %in% lcwords2)/length(lcwords1) + 
          sum(lcwords2 %in% lcwords1)/length(lcwords2)) / 2

And would yield a score of 并且会得到分数

[1] 0.7070707

where 1 would be equal vectors. 其中1将等于向量。
You'd very likely need to wrap this in a function which would yield a result, see a similar post here . 您很可能需要将其包装在产生结果的函数中,请参见此处的类似文章

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM