[英]Find similarity between column of an id in r/python
The data is as follows:数据如下:
id <- c(1,1,2,1,3,2)
address <- c("ABC Ret1","ABC","NY AB1","XYZ","DEL1","NY AB")
similar_address <- data.frame(id,address)
I want to find similar address of each id and make a new data.frame as below我想找到每个 id 的相似地址并创建一个新的 data.frame 如下
Preferred is cosine similarity is used to find similarity among address首选是余弦相似度,用于查找地址之间的相似度
Using function sim.strings from package qlcMatrix:使用来自 package qlcMatrix 的 function sim.strings:
get_count_of_similar_strings = function(x){
issim=(sum(sim.strings(x)>=.5) - length(x))/1
isnotsim=length(x)-issim
c(issim,isnotsim)
}
out = by(similar_address$address
,similar_address$id
,get_count_of_similar_strings
,simplify = T)
data.frame(id=unique(similar_address$id),t(sapply(out,I)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.