[英]pandas:calculate jaccard similarity for every row based on the value in another column
[英]pandas:calculate jaccard similarity for every row based on the value in multiple columns
我有一个 dataframe,如下所示,但行数更多。 对于第一列中的每个文档,第二列中有一些相似的标签,最后一列中有一些字符串。
import pandas as pd
data = {'First': ['First doc', 'Second doc','Third doc','First doc', 'Second doc','Third doc'
,'First doc', 'Second doc','Third doc'],
'second': ['First', 'Second','Third','second', 'third','first',
'third','first','second'],
'third': [['old','far','gold','door'], ['old','view','bold','values'],
['new','view','sure','window'],['old','bored','gold','door'],
['valued','this','bold','door'],['new','view','seen','shirt'],
['old','bored','blouse','door'], ['valued','this','bold','open'],
['new','view','seen','win']]}
df = pd.DataFrame (data, columns = ['First','second','third'])
df
我偶然发现了这段关于 jaccard 相似性的代码:
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float(len(intersection)) / len(union) * 100
结果我想得到的是将第三列的每一行作为文档并迭代地比较每一对并输出一个带有第一列和第二列的行名的度量,所以所有组合都是这样的:
first doc(first) and second doc(first) are 23 percent similar
我已经问过类似的问题并尝试修改答案,但添加多列没有任何运气
这不是很优雅,但希望它能完成工作。 我将“第三”列转换为列表。 对于此列表中的每个项目,我创建了一个新数据框 new_df,它是原始 dataframe df 的副本。 我在 new_df 中添加了一列“比较”,以记录正在比较的“第一”列。 然后我在 df 上使用 lambda function 来计算两个字符串列表的词汇重叠
third_list = df['third'].tolist()
for i in range(0,len(third_list)):
new_df = df.copy()
new_df["compared with"] = df['First'].iloc[i]
new_df["sim"] = df.apply(lambda x: lexical_overlap(x[2],df['third'].iloc[i] ), axis =1)
print("\n\n")
print(new_df[['First', 'compared with', 'sim']])
这将产生以下 output。与自身相比,文档获得最高的相似度。
First compared with sim
0 First doc First doc 100.000000
1 Second doc First doc 14.285714
2 Third doc First doc 0.000000
3 First doc First doc 60.000000
4 Second doc First doc 14.285714
5 Third doc First doc 0.000000
6 First doc First doc 33.333333
7 Second doc First doc 0.000000
8 Third doc First doc 0.000000
First compared with sim
0 First doc Second doc 14.285714
1 Second doc Second doc 100.000000
2 Third doc Second doc 14.285714
3 First doc Second doc 14.285714
4 Second doc Second doc 14.285714
5 Third doc Second doc 14.285714
6 First doc Second doc 14.285714
7 Second doc Second doc 14.285714
8 Third doc Second doc 14.285714
如果您愿意,可以按如下方式替换打印第 7 行:
print(new_df.apply(lambda x:" ".join([x[0],'and',x[3], 'are', "{:.2f}".format(x[4]),'percent similar']), axis =1))
这将创建 output:
0 First doc and First doc are 100.00 percent sim...
1 Second doc and First doc are 14.29 percent sim...
2 Third doc and First doc are 0.00 percent similar
3 First doc and First doc are 60.00 percent similar
4 Second doc and First doc are 14.29 percent sim...
5 Third doc and First doc are 0.00 percent similar
6 First doc and First doc are 33.33 percent similar
7 Second doc and First doc are 0.00 percent similar
8 Third doc and First doc are 0.00 percent similar
dtype: object
好的,我想出了如何在Amit Amola 的这个回复的帮助下做到这一点,所以我所做的是改进代码以获得所有组合:
from itertools import combinations
for val in list(combinations(range(len(df)), 2)):
firstlist = df.iloc[val[0],2]
secondlist = df.iloc[val[1],2]
value = round(lexical_overlap(firstlist,secondlist),2)
print(f"{df.iloc[val[0],0] + df.iloc[val[0],1]} and {df.iloc[val[1],0]+ df.iloc[val[1],1]}'s value is: {value}")
这将从第一列和第二列返回值
sample output:
First doc first and second doc first's value is 26.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.