简体   繁体   English

计算 R 中两个向量/字符串之间的相似度

[英]Calculating similarity between two vectors/Strings in R

It might be similar question asked in this forum but I feel my requirement peculiar.在这个论坛上可能会提出类似的问题,但我觉得我的要求很特殊。 I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable "SuggestedTerms" with 17,000 observations我有一个数据框 df1,其中包含具有 40,000 个观察值的变量“WrittenTerms”,我有另一个具有 17,000 个观察值的变量“SuggestedTerms”的数据名 df2

I need to calculate the similarity between "written Term" and "suggestedterms"我需要计算“书面术语”和“建议术语”之间的相似度

df1$WrittenTerms df1$WrittenTerms

head pain头疼

lung cancer肺癌

abdminal pain腹痛

df2$suggestedterms df2$建议条款

cardio attack心脏病发作

breast cancer乳腺癌

abdomen pain腹部疼痛

head ache头痛

lung cancer肺癌

I need to get the output as follow我需要得到 output 如下

df1$WrittenTerms df2$suggestedterms Similarity_percentage df1$WrittenTerms df2$suggestedterms Similarity_percentage

head pain head ache 50%头痛 头痛 50%

lung cancer lung cancer 100%肺癌 肺癌 100%

abdminal pain abdomen pain 80%腹痛 腹痛 80%

I am writing the below code to meet the requirement but its taking more time as it involves for loop and is there any way where we can find similarity using TF IDF OR any other approach which will take less time我正在编写下面的代码来满足要求,但它需要更多时间,因为它涉及 for 循环,有没有什么方法可以使用 TF IDF 或任何其他需要更少时间的方法找到相似性

df_list <- data.frame(check.names = FALSE) # Creating empty dataframe

# calculating similarity between strings.

for(i in df1$WrittenTerms){
  df2$oldsim<- stringdist(i,df2$suggestedterms,method = "lv")
  df2$oldsim <- 1 - df2$oldsim / nchar(as.character(df2$suggestedterms))
  df2 <- head(df2[order(df2$oldsim, decreasing = TRUE),],1)
  df_list <- rbind(df_list, df2)
}

df1 <- cbind(df1, df_list)

The base library's adist function gives you Levenshtein distances between two arrays, returning a matrix of distances for each pair of entries.基础库的adist function 为您提供两个 arrays 之间的 Levenshtein 距离,返回每对条目的距离矩阵。 You could write a function that converts the Levenshtein metric into your transformation:您可以编写一个 function 将 Levenshtein 度量转换为您的转换:

my_dist <- function(x, y) 1 - adist(x, y) / nchar(y)
x <- my_dist(df1$WrittenTerms, df2$suggestedterms)

Now obtain the maximum value of your metric for each row of x, which will be the best suggestedterm for each WrittenTerms :现在为 x 的每一行获取度量的最大值,这将是每个WrittenTerms的最佳suggestedterm词:

mx <- apply(x, 1, function(y) {mx <- which.max(y); c(y[mx], mx)})

Your final desired data frame could then be constructed as follows:然后可以按如下方式构建您最终所需的数据框:

data.frame(Written.Terms = df1$WrittenTerms, 
           suggestedterms = df2$suggestedterms[mx[2, ]], 
           Similarity_percentage = mx[1, ])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM