[英]string matching irrespective of order of words and short forms in R : fuzzy string matching in R
I am new to R, and want to compare 2 strings(addresses) where 我是R的新手,想比较2个字符串(地址),其中
There sometimes could be space in a word in 1 of the srings eg Pitampura -> Pitam pura. 有时在其中一个字串中的单词中可能会有空格,例如Pitampura-> Pitam pura。
eg 例如
S1 = QU 23/24 Shalimar Bagh, Pitampura, Street no. S1 = QU 23/24 Shalimar Bagh,Pitampura,街号 22, delhi 22,德里
S2 = QU Flat 23/24 Pitam Pura, St. No. 22, Shalimar Bagh, Delhi S2 = QU Flat 23/24 Pitam Pura,St.No.22,Shalimar Bagh,德里
So far, I have removed the special characters, whitespaces, redundant words in the address. 到目前为止,我已经删除了地址中的特殊字符,空格和多余的单词。
Would a distance formula like cosine or levenshtein distance, be a good choice. 像余弦或levenshtein距离这样的距离公式将是一个不错的选择。 If yes, how to apply the same in R without using any package . 如果是,如何在不使用任何包的情况下在R中应用它。
Don't have liberty to install any external package. 没有安装任何外部软件包的自由。
Thanks in advance. 提前致谢。
Not a direct answer but an idea: you could calculate a score of the splitted lowercase words which occur in the other vector and establish some kind of threshold. 这不是一个直接的答案,而是一个主意:您可以计算出现在另一个向量中的分割小写单词的分数,并建立某种阈值。 In R
this could be: 在R
可能是:
S1 <- "QU 23/24 Shalimar Bagh, Pitampura, Street no. 22, delhi"
lcwords1 <- tolower(unlist(strsplit(S1, " ")))
S2 <- "QU Flat 23/24 Pitam Pura, St. No. 22, Shalimar Bagh, Delhi"
lcwords2 <- tolower(unlist(strsplit(S2, " ")))
(score <- sum(lcwords1 %in% lcwords2)/length(lcwords1) +
sum(lcwords2 %in% lcwords1)/length(lcwords2)) / 2
And would yield a score of 并且会得到分数
[1] 0.7070707
where 1
would be equal vectors. 其中1
将等于向量。
You'd very likely need to wrap this in a function which would yield a result, see a similar post here . 您很可能需要将其包装在产生结果的函数中,请参见此处的类似文章 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.