简体   繁体   English

在 r/python 中查找 id 列之间的相似性

[英]Find similarity between column of an id in r/python

The data is as follows:数据如下:

id <- c(1,1,2,1,3,2)
address <- c("ABC Ret1","ABC","NY AB1","XYZ","DEL1","NY AB")
similar_address <- data.frame(id,address)

I want to find similar address of each id and make a new data.frame as below我想找到每个 id 的相似地址并创建一个新的 data.frame 如下

在此处输入图像描述

Preferred is cosine similarity is used to find similarity among address首选是余弦相似度,用于查找地址之间的相似度

Using function sim.strings from package qlcMatrix:使用来自 package qlcMatrix 的 function sim.strings:

get_count_of_similar_strings = function(x){
  issim=(sum(sim.strings(x)>=.5) - length(x))/1
  isnotsim=length(x)-issim
  c(issim,isnotsim)
}

out = by(similar_address$address
,similar_address$id
,get_count_of_similar_strings
,simplify = T)    

data.frame(id=unique(similar_address$id),t(sapply(out,I)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM