[英]R: Comparing Text Similarity between Neighbour Strings
I am trying to compare texts in a column to identify the text similarity, in terms of whether adjacent letters in the texts are similar; 我试图比较一列中的文本,以确定文本相似性,就文本中相邻字母是否相似而言; how many substition is necessary for two adjacent letters to make the both letters same.
两个相邻字母需要多少个子字以使两个字母相同。
Example: JANE-JNAE (1 - AN/NA), MARY-MART(0), CLERA-LCREA(2 - CL/LC & ER/RE) 示例:JANE-JNAE(1 - AN / NA),MARY-MART(0),CLERA-LCREA(2 - CL / LC&ER / RE)
I have tried stringdist methods but they do not provide solutions for my problem. 我尝试过stringdist方法,但它们没有为我的问题提供解决方案。
Since I am new to R, I could not write an efficent code to show here: 由于我是R的新手,我不能写一个高效的代码来显示在这里:
substition <- function(text1,tex2){
if(text1 == text2){
return(TRUE)
}
if(nchar(text1) != nchar(text2)){
return(FALSE)
}
vec1 <- strsplit("text1",split="")[[1]]
vec2 <- strsplit("text2",split="")[[1]]
(can't go on)
. 。 But to illustrate:
但要说明:
data is something like this 数据是这样的
df$NO df$names
1 JANE
2 MARY
3 CLERA
4 JNAE
5 LCREA
6 MART
and the desired output is: 并且所需的输出是:
df$NO df$names df$substition
1 JANE 1
2 MARY 0
3 CLERA 2
4 JNAE 1
5 LCREA 2
6 MART 0
You can use the Levenshtein distance ( https://en.wikipedia.org/wiki/Levenshtein_distance ) between strings. 您可以在字符串之间使用Levenshtein距离( https://en.wikipedia.org/wiki/Levenshtein_distance )。 The distance gives the minimal number of insertions, deletions and substitutions needed to transform one string into another.
距离给出了将一个字符串转换为另一个字符串所需的最小插入,删除和替换次数。
Usage 用法
adist(
c("lazy", "lasso", "lassie"),
c("lazy", "lazier", "laser")
)
Returns a 3x3 matrix of distances: 返回3x3距离矩阵:
## [,1] [,2] [,3]
## [1,] 0 3 3
## [2,] 3 4 2
## [3,] 4 3 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.