[英]How to speed up Distance Matrix calculation?
任務:我是python的新手,目前正在執行集群任務,在該集群中,我計算用戶點擊流之間的相似度。 因此,我使用Jaccard Index比較每個兩個用戶的點擊集(clickstream),將結果保存在NxN距離矩陣中,然后對該距離矩陣執行Wards聚類算法。
問題:從1天起就嘗試了所有數據(大約85個會話ID /用戶),它像一種魅力一樣工作。 現在有949個唯一用戶,計算可能要花一些時間,這可能是由於我的代碼效率低下所致。
這是我的df stack_dataframe的快照:9329行x 2列
這是我計算距離矩陣的代碼:
import itertools
import pandas as pd
# Method to compute Jaccard similarity index between two sets
def compute_jaccard(session1_vals, session2_vals):
intersection = session1_vals.intersection(session2_vals)
union = session1_vals.union(session2_vals)
jaccard = len(intersection)/float(len(union))
return jaccard
stID_keys = stack_dataframe.groupby(['Session ID']).groups.keys()
print("hallo1")
New_stack_df = stack_dataframe.pivot(columns="Session ID", values="Page")
print("hallo2")
sim_df = pd.DataFrame(columns=ID_keys, index=ID_keys)
# Iterate through columns and compute metric
test = 0
print("hallo3")
for col_pair in itertools.combinations(New_stack_df.columns, 2):
print(test)
test += 1
u1= col_pair[0]
u2 = col_pair[1]
sim_df.loc[col_pair] = compute_jaccard(set(New_stack_df[u1].dropna()),
set(New_stack_df[u2].dropna()))
print(sim_df)
任何幫助,不勝感激,謝謝!
您的方法效率極低。 效率低下主要是由於兩個原因:
itertools.combinations(..)
的O(n ^ 2)循環 我們通過解決這些
所以代碼是:
from __future__ import division
import time
import pandas as pd
import numpy as np
from scipy.spatial import distance
import numba as nb
def _make_2D_array(lis):
n = len(lis)
lengths = np.array([len(x) for x in lis])
max_len = max(lengths)
arr = np.zeros((n, max_len))
for i in range(n):
arr[i, :lengths[i]] = lis[i]
return arr, lengths
@nb.jit(nopython=True, cache=True)
def compute_jaccard(session1, session2, arr, lengths):
"""Jited funciton to calculate jaccard distance
"""
session1, session2 = session1[0], session2[0]
intersection, union = 0, 0
if(lengths[session2] > lengths[session1]):
session1, session2 = session2, session1
marked = np.zeros((lengths[session2],))
for x in arr[session1][:lengths[session1]]:
x_in_2 = arr[session2][:lengths[session2]] == x
marked[x_in_2] = 1
if(np.any(x_in_2)):
intersection+=1
union+=1
else:
union+=1
union+=np.sum(marked==0)
jaccard = intersection/union
return jaccard
def calculate_sim_between(stack_dataframe):
# get integer encodings for session ids and pages
session_encode, sessions = pd.factorize(stack_dataframe["Session ID"])
page_encode, pages = pd.factorize(stack_dataframe["Page"])
# take unique pages in each session
pages_in_sessions = [np.unique(page_encode[session_encode==x]) for x in range(len(sessions))]
# convert the list of lists to numpy array
arr, lengths = _make_2D_array(pages_in_sessions)
# make a dummy array like [[0], [1], [2] ...] to get the distances between every pair of sessions
_sessions = np.arange(len(sessions))[:, np.newaxis]
# get the distances
distances = distance.cdist(_sessions, _sessions, compute_jaccard, arr=arr, lengths=lengths)
sim_df = pd.DataFrame(distances, columns=sessions, index=sessions)
return sim_df
請注意,我們通過numba編譯compute_jaccard函數來觸發事件的單級循環時間。 如果您不想安裝numba,只需注釋掉裝飾器即可。
在此樣本數據上:
from faker import Faker
fake = Faker()
stack_dataframe = pd.DataFrame({"Session ID":[fake.name() for i in range(200)], "Page":[fake.name() for i in range(200)]})
時間是
您的方法:69.465s
無抖動:0.374s
使用jit(第二次運行時,可減少編譯時間):0.147s
PS:由於我們使用假數據運行較大的樣本來觀察加速,因此在您的實際數據上,時序配置可能會略有不同。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.