[英]Python cosine-similarity on all possible pairs in list
我正在慢慢學習Python,並且想知道是否可以提供幫助。 我有一個ips, occeence_id和向量的列表,稱為info_list :
('188.74.64.243', '1', ['0, 1, 1, 0'])
('99.229.98.18', '1', ['0, 1, 1, 1'])
('86.41.253.102', '1', ['1, 1, 1, 1'])
('188.74.64.243', '2', ['0, 1, 1, 0'])
('99.229.98.18', '2', ['0, 1, 1, 1'])
('86.41.253.102', '2', ['1, 1, 1, 1'])
我想計算余弦相似度。 我有以下幾點:
def cosine_similarity(v1,v2):
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)
v1 = [0, 1, 1, 0]
v2 = [1, 1, 1, 1]
print(v1, v2, cosine_similarity(v1,v2))
聲明了v1和v2時,這很好用。 我的問題是我陷入了一個小漏洞,似乎無法解決我的問題。 我希望能有所幫助。
我需要遍歷info_list,考慮到每對具有相同occurrence_id計算cosine_similarity IPS的。
輸出的示例將是這樣的列表:
('188.74.64.243', '99.229.98.18', '1', ['0, 1, 1, 0'],['0, 1, 1, 1'], 0.82 )
('188.74.64.243', '86.41.253.102', '1', ['0, 1, 1, 0'],['1, 1, 1, 1'], 0.70 )
('86.41.253.102', '99.229.98.18', '1', ['0, 1, 1, 1'],['1, 1, 1, 1'], 0.87 )
您可以使用Python的groupby
和combinations
功能如下:
from itertools import groupby, combinations
import math
def cosine_similarity(v1,v2):
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx * sumyy)
info_list = [
('188.74.64.243', '1', [0, 1, 1, 0]),
('99.229.98.18', '1', [0, 1, 1, 1]),
('86.41.253.102', '1', [1, 1, 1, 1]),
('188.74.64.243', '2', [0, 1, 1, 0]),
('99.229.98.18', '2', [0, 1, 1, 1]),
('86.41.253.102', '2', [1, 1, 1, 1]),
]
for k, g in groupby(info_list, key=lambda x: x[1]):
for x, y in combinations(g, 2):
print (x[0], y[0], x[1], x[2], y[2], cosine_similarity(x[2], y[2]))
print
這將顯示以下輸出:
('188.74.64.243', '99.229.98.18', '1', [0, 1, 1, 0], [0, 1, 1, 1], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '1', [0, 1, 1, 0], [1, 1, 1, 1], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '1', [0, 1, 1, 1], [1, 1, 1, 1], 0.8660254037844387)
('188.74.64.243', '99.229.98.18', '2', [0, 1, 1, 0], [0, 1, 1, 1], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '2', [0, 1, 1, 0], [1, 1, 1, 1], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '2', [0, 1, 1, 1], [1, 1, 1, 1], 0.8660254037844387)
如果列表未排序,即未將ID分組在一起,則可以替換以下行:
for k, g in groupby(sorted(info_list, key=lambda x: x[1]), key=lambda x: x[1]):
保持數據不變(用字符串表示的向量),可以編寫一個函數,該函數接受兩個元組,將字符串解壓縮為int向量,應用相似性函數,然后重新打包。 然后-通過基本的嵌套循環使用此函數:
import math
def cosine_similarity(v1,v2):
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x, y = v1[i],v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)
def c_sim(t1,t2):
ips1,id1,vlist1 = t1
ips2,id2,vlist2 = t2
v1 = [int(i) for i in vlist1[0].split(',')]
v2 = [int(i) for i in vlist2[0].split(',')]
if id1 == id2:
return ips1,ips2,id1,vlist1,vlist2,cosine_similarity(v1,v2)
def process_list(data_list):
n = len(data_list)
ret_list = []
for i in range(n-1):
for j in range(i+1,n):
t1,t2 = data_list[i],data_list[j]
t = c_sim(t1,t2)
if t: ret_list.append(t)
return ret_list
data = [('188.74.64.243', '1', ['0, 1, 1, 0']),
('99.229.98.18', '1', ['0, 1, 1, 1']),
('86.41.253.102', '1', ['1, 1, 1, 1']),
('188.74.64.243', '2', ['0, 1, 1, 0']),
('99.229.98.18', '2', ['0, 1, 1, 1']),
('86.41.253.102', '2', ['1, 1, 1, 1'])]
for t in process_list(data): print(t)
輸出:
('188.74.64.243', '99.229.98.18', '1', ['0, 1, 1, 0'], ['0, 1, 1, 1'], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '1', ['0, 1, 1, 0'], ['1, 1, 1, 1'], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '1', ['0, 1, 1, 1'], ['1, 1, 1, 1'], 0.8660254037844387)
('188.74.64.243', '99.229.98.18', '2', ['0, 1, 1, 0'], ['0, 1, 1, 1'], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '2', ['0, 1, 1, 0'], ['1, 1, 1, 1'], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '2', ['0, 1, 1, 1'], ['1, 1, 1, 1'], 0.8660254037844387)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.