查找不同 pandas dataframe 之间的余弦相似度

Question

I have three pandas dataframe, suppose group_1, group_2, group_3我有三个 pandas dataframe，假设 group_1，group_2，group_3

import pandas as pd 
   group_1 = pd.DataFrame({'A':[1,0,1,1,1], 'B':[1,1,1,1,1]})
   group_2 = pd.DataFrame({'A':[1,1,1,1,1], 'B':[1,1,0,0,0]})
   group_3 = pd.DataFrame({'A':[1,1,1,1,1], 'B':[0,0,0,0,0]})

filled dummy value, all value will be binary for above group填充的虚拟值，所有值将是上述组的二进制

Now, there is another dataframe, new one现在，还有一个 dataframe，新的

new_data_frame = pd.DataFrame({'A':[1,1,1,1,1], 'B':[0,0,0,0,0],'mobile': ['xxxxx','yyyyy','zzzzz','wwwww','mmmmmm']})
new_data_frame.set_index('mobilenumber')

         A  B
 mobile     
 xxxxx  1   0
 yyyyy  1   0
 zzzzz  1   0
 wwwww  1   0
 mmmmmm 1   0

For each mobile in new_dataframe, I want to calculate mean cosine similarity(sum all score and divide by length of group dataframe), mobile number which have highest score will be assign to a particular group对于 new_dataframe 中的每个手机，我想计算平均余弦相似度（所有得分相加并除以组数据帧的长度），得分最高的手机号码将分配给特定组

So my expected output will be所以我预期的 output 将是

   mobile group
   xxxxx  group_1
   yyyyy  group_1
   zzzzz  group_3

something like this像这样的东西

  for x in new_data_frame.to_numpy():
      score = []
      for y in group_1.to_numpy():
         a =  cosine_similarity(x,y)
         score.append(a)
      mean_score = sum(score)/len(y)

I have added below code, is there a better way to achive this我在下面添加了代码，有没有更好的方法来实现这一点

def max_group(x,group_1, group_2, group_3 ):
    x_ = x.tolist()
    val =  x_[:-1]

    group = [group_1, group_2, group_3]

    score = []
    for i in range(len(group)):
        a = cosine_similarity([val], group[i].to_numpy())
        print('<---->')
        print(a.mean())
        score.append((a.mean(), i))

    return max(score[1])
 
 new_data_frame['group'] = new_data_frame.apply(lambda x: max_group(x, group_1, group_2, group_3), axis=1)

Answer 1

Solution解决方案

Create a mapping of group names and values then for each group calculate the mean cosine similarity inside a dict comprehension, then create a new dataframe from the computed scores and use idxmax to find the name of group having max mean similarity score创建组名称和值的映射，然后为每个组计算字典理解内的平均余弦相似度，然后从计算的分数中创建一个新的 dataframe 并使用idxmax查找具有最大平均相似度分数的组的名称

from sklearn.metrics.pairwise import cosine_similarity

grps = {'group_1': group_1, 'group_2': group_2, 'group_3': group_3}
scores = {k: cosine_similarity(new_data_frame, g).mean(1) for k, g in grps.items()}

pd.DataFrame(scores, index=new_data_frame.index).idxmax(1)

Result结果

mobile
xxxxx     group_3
yyyyy     group_3
zzzzz     group_3
wwwww     group_3
mmmmmm    group_3
dtype: object

查找不同 pandas dataframe 之间的余弦相似度

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-09-03 07:48:58

Solution解决方案

Result结果

查找不同 pandas dataframe 之间的余弦相似度

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-09-03 07:48:58

Solution解决方案

Result结果

解决方案1
2 已采纳 2022-09-03 07:48:58