pandas：根据多列值计算每一行的jaccard相似度

Question

我有一个 dataframe，如下所示，但行数更多。 对于第一列中的每个文档，第二列中有一些相似的标签，最后一列中有一些字符串。

import pandas as pd

data = {'First':  ['First doc', 'Second doc','Third doc','First doc', 'Second doc','Third doc'
              ,'First doc', 'Second doc','Third doc'],
    'second':  ['First', 'Second','Third','second', 'third','first',
               'third','first','second'],
    'third': [['old','far','gold','door'], ['old','view','bold','values'],
      ['new','view','sure','window'],['old','bored','gold','door'], 
      ['valued','this','bold','door'],['new','view','seen','shirt'],
      ['old','bored','blouse','door'], ['valued','this','bold','open'],
      ['new','view','seen','win']]}

df = pd.DataFrame (data, columns = ['First','second','third'])
df

我偶然发现了这段关于 jaccard 相似性的代码：

def lexical_overlap(doc1, doc2): 
   words_doc1 = set(doc1) 
   words_doc2 = set(doc2)

   intersection = words_doc1.intersection(words_doc2)
   union = words_doc1.union(words_doc2)

   return float(len(intersection)) / len(union) * 100

结果我想得到的是将第三列的每一行作为文档并迭代地比较每一对并输出一个带有第一列和第二列的行名的度量，所以所有组合都是这样的:

   first doc(first) and second doc(first) are 23 percent similar

我已经问过类似的问题并尝试修改答案，但添加多列没有任何运气

Answer 1

这不是很优雅，但希望它能完成工作。 我将“第三”列转换为列表。 对于此列表中的每个项目，我创建了一个新数据框 new_df，它是原始 dataframe df 的副本。 我在 new_df 中添加了一列“比较”，以记录正在比较的“第一”列。 然后我在 df 上使用 lambda function 来计算两个字符串列表的词汇重叠

third_list = df['third'].tolist()
for i in range(0,len(third_list)):
    new_df = df.copy()
    new_df["compared with"] = df['First'].iloc[i] 
    new_df["sim"] = df.apply(lambda x: lexical_overlap(x[2],df['third'].iloc[i] ), axis =1)
    print("\n\n")
    print(new_df[['First', 'compared with', 'sim']])

这将产生以下 output。与自身相比，文档获得最高的相似度。


        First compared with         sim
0   First doc     First doc  100.000000
1  Second doc     First doc   14.285714
2   Third doc     First doc    0.000000
3   First doc     First doc   60.000000
4  Second doc     First doc   14.285714
5   Third doc     First doc    0.000000
6   First doc     First doc   33.333333
7  Second doc     First doc    0.000000
8   Third doc     First doc    0.000000



        First compared with         sim
0   First doc    Second doc   14.285714
1  Second doc    Second doc  100.000000
2   Third doc    Second doc   14.285714
3   First doc    Second doc   14.285714
4  Second doc    Second doc   14.285714
5   Third doc    Second doc   14.285714
6   First doc    Second doc   14.285714
7  Second doc    Second doc   14.285714
8   Third doc    Second doc   14.285714

如果您愿意，可以按如下方式替换打印第 7 行：

print(new_df.apply(lambda x:" ".join([x[0],'and',x[3], 'are', "{:.2f}".format(x[4]),'percent similar']), axis =1))

这将创建 output：

0    First doc and First doc are 100.00 percent sim...
1    Second doc and First doc are 14.29 percent sim...
2     Third doc and First doc are 0.00 percent similar
3    First doc and First doc are 60.00 percent similar
4    Second doc and First doc are 14.29 percent sim...
5     Third doc and First doc are 0.00 percent similar
6    First doc and First doc are 33.33 percent similar
7    Second doc and First doc are 0.00 percent similar
8     Third doc and First doc are 0.00 percent similar
dtype: object

Answer 2

好的，我想出了如何在Amit Amola 的这个回复的帮助下做到这一点，所以我所做的是改进代码以获得所有组合：

from itertools import combinations

for val in list(combinations(range(len(df)), 2)):
     firstlist = df.iloc[val[0],2]
     secondlist = df.iloc[val[1],2]

     value = round(lexical_overlap(firstlist,secondlist),2)

     print(f"{df.iloc[val[0],0] + df.iloc[val[0],1]} and {df.iloc[val[1],0]+ df.iloc[val[1],1]}'s value is: {value}")

这将从第一列和第二列返回值

sample output:
First doc first and second doc first's value is 26.

pandas：根据多列值计算每一行的jaccard相似度

问题描述

2 个解决方案

解决方案1
0 2020-12-22 19:54:28

解决方案2
0 2020-12-23 09:00:28

pandas：根据多列值计算每一行的jaccard相似度

问题描述

2 个解决方案

解决方案1 0 2020-12-22 19:54:28

解决方案2 0 2020-12-23 09:00:28

解决方案1
0 2020-12-22 19:54:28

解决方案2
0 2020-12-23 09:00:28