[英]Calculating similarity between two vectors/Strings in R
It might be similar question asked in this forum but I feel my requirement peculiar.在这个论坛上可能会提出类似的问题,但我觉得我的要求很特殊。 I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable "SuggestedTerms" with 17,000 observations
我有一个数据框 df1,其中包含具有 40,000 个观察值的变量“WrittenTerms”,我有另一个具有 17,000 个观察值的变量“SuggestedTerms”的数据名 df2
I need to calculate the similarity between "written Term" and "suggestedterms"我需要计算“书面术语”和“建议术语”之间的相似度
df1$WrittenTerms df1$WrittenTerms
head pain头疼
lung cancer肺癌
abdminal pain腹痛
df2$suggestedterms df2$建议条款
cardio attack心脏病发作
breast cancer乳腺癌
abdomen pain腹部疼痛
head ache头痛
lung cancer肺癌
I need to get the output as follow我需要得到 output 如下
df1$WrittenTerms df2$suggestedterms Similarity_percentage df1$WrittenTerms df2$suggestedterms Similarity_percentage
head pain head ache 50%头痛 头痛 50%
lung cancer lung cancer 100%肺癌 肺癌 100%
abdminal pain abdomen pain 80%腹痛 腹痛 80%
I am writing the below code to meet the requirement but its taking more time as it involves for loop and is there any way where we can find similarity using TF IDF OR any other approach which will take less time我正在编写下面的代码来满足要求,但它需要更多时间,因为它涉及 for 循环,有没有什么方法可以使用 TF IDF 或任何其他需要更少时间的方法找到相似性
df_list <- data.frame(check.names = FALSE) # Creating empty dataframe
# calculating similarity between strings.
for(i in df1$WrittenTerms){
df2$oldsim<- stringdist(i,df2$suggestedterms,method = "lv")
df2$oldsim <- 1 - df2$oldsim / nchar(as.character(df2$suggestedterms))
df2 <- head(df2[order(df2$oldsim, decreasing = TRUE),],1)
df_list <- rbind(df_list, df2)
}
df1 <- cbind(df1, df_list)
The base library's adist
function gives you Levenshtein distances between two arrays, returning a matrix of distances for each pair of entries.基础库的
adist
function 为您提供两个 arrays 之间的 Levenshtein 距离,返回每对条目的距离矩阵。 You could write a function that converts the Levenshtein metric into your transformation:您可以编写一个 function 将 Levenshtein 度量转换为您的转换:
my_dist <- function(x, y) 1 - adist(x, y) / nchar(y)
x <- my_dist(df1$WrittenTerms, df2$suggestedterms)
Now obtain the maximum value of your metric for each row of x, which will be the best suggestedterm
for each WrittenTerms
:现在为 x 的每一行获取度量的最大值,这将是每个
WrittenTerms
的最佳suggestedterm
词:
mx <- apply(x, 1, function(y) {mx <- which.max(y); c(y[mx], mx)})
Your final desired data frame could then be constructed as follows:然后可以按如下方式构建您最终所需的数据框:
data.frame(Written.Terms = df1$WrittenTerms,
suggestedterms = df2$suggestedterms[mx[2, ]],
Similarity_percentage = mx[1, ])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.