[英]How to speed up Distance Matrix calculation?
Task: I am new to python and currently working on a clustering task where I compute the similarity between users clickstreams. 任务:我是python的新手,目前正在执行集群任务,在该集群中,我计算用户点击流之间的相似度。 Therefore I am using the Jaccard Index to compare the click sets (clickstream) of each two users, saving the results in an NxN distance matrix and then performing Wards clustering algorithm on this distance matrix. 因此,我使用Jaccard Index比较每个两个用户的点击集(clickstream),将结果保存在NxN距离矩阵中,然后对该距离矩阵执行Wards聚类算法。
Problem: Having tried out everything with data from 1 day (around 85 Session IDs/users), it worked like a charm. 问题:从1天起就尝试了所有数据(大约85个会话ID /用户),它像一种魅力一样工作。 Now with 949 unique users, the computation takes ages, probably due to my inefficient code. 现在有949个唯一用户,计算可能要花一些时间,这可能是由于我的代码效率低下所致。
Here is a snapshot from my df stack_dataframe: 9329 rows x 2 columns 这是我的df stack_dataframe的快照:9329行x 2列
Here is my code for computing the distance matrix: 这是我计算距离矩阵的代码:
import itertools
import pandas as pd
# Method to compute Jaccard similarity index between two sets
def compute_jaccard(session1_vals, session2_vals):
intersection = session1_vals.intersection(session2_vals)
union = session1_vals.union(session2_vals)
jaccard = len(intersection)/float(len(union))
return jaccard
stID_keys = stack_dataframe.groupby(['Session ID']).groups.keys()
print("hallo1")
New_stack_df = stack_dataframe.pivot(columns="Session ID", values="Page")
print("hallo2")
sim_df = pd.DataFrame(columns=ID_keys, index=ID_keys)
# Iterate through columns and compute metric
test = 0
print("hallo3")
for col_pair in itertools.combinations(New_stack_df.columns, 2):
print(test)
test += 1
u1= col_pair[0]
u2 = col_pair[1]
sim_df.loc[col_pair] = compute_jaccard(set(New_stack_df[u1].dropna()),
set(New_stack_df[u2].dropna()))
print(sim_df)
Any help much appreciated, thanks! 任何帮助,不胜感激,谢谢!
Your method is highly inefficient. 您的方法效率极低。 The inefficiency is mainly due to two reasons : 效率低下主要是由于两个原因:
itertools.combinations(..)
itertools.combinations(..)
的O(n ^ 2)循环 We solve these by 我们通过解决这些
So the code is: 所以代码是:
from __future__ import division
import time
import pandas as pd
import numpy as np
from scipy.spatial import distance
import numba as nb
def _make_2D_array(lis):
n = len(lis)
lengths = np.array([len(x) for x in lis])
max_len = max(lengths)
arr = np.zeros((n, max_len))
for i in range(n):
arr[i, :lengths[i]] = lis[i]
return arr, lengths
@nb.jit(nopython=True, cache=True)
def compute_jaccard(session1, session2, arr, lengths):
"""Jited funciton to calculate jaccard distance
"""
session1, session2 = session1[0], session2[0]
intersection, union = 0, 0
if(lengths[session2] > lengths[session1]):
session1, session2 = session2, session1
marked = np.zeros((lengths[session2],))
for x in arr[session1][:lengths[session1]]:
x_in_2 = arr[session2][:lengths[session2]] == x
marked[x_in_2] = 1
if(np.any(x_in_2)):
intersection+=1
union+=1
else:
union+=1
union+=np.sum(marked==0)
jaccard = intersection/union
return jaccard
def calculate_sim_between(stack_dataframe):
# get integer encodings for session ids and pages
session_encode, sessions = pd.factorize(stack_dataframe["Session ID"])
page_encode, pages = pd.factorize(stack_dataframe["Page"])
# take unique pages in each session
pages_in_sessions = [np.unique(page_encode[session_encode==x]) for x in range(len(sessions))]
# convert the list of lists to numpy array
arr, lengths = _make_2D_array(pages_in_sessions)
# make a dummy array like [[0], [1], [2] ...] to get the distances between every pair of sessions
_sessions = np.arange(len(sessions))[:, np.newaxis]
# get the distances
distances = distance.cdist(_sessions, _sessions, compute_jaccard, arr=arr, lengths=lengths)
sim_df = pd.DataFrame(distances, columns=sessions, index=sessions)
return sim_df
Notice that, we numba compile compute_jaccard function to eke out event the single level loop time. 请注意,我们通过numba编译compute_jaccard函数来触发事件的单级循环时间。 If you don't want to install numba, just comment out the decorator. 如果您不想安装numba,只需注释掉装饰器即可。
On this sample data: 在此样本数据上:
from faker import Faker
fake = Faker()
stack_dataframe = pd.DataFrame({"Session ID":[fake.name() for i in range(200)], "Page":[fake.name() for i in range(200)]})
timings are 时间是
Your method : 69.465s 您的方法:69.465s
Without jit : 0.374s 无抖动:0.374s
With jit(on second run, to discount for the compile time) : 0.147s 使用jit(第二次运行时,可减少编译时间):0.147s
PS : Since we use fake data to run on a largish sample to observe the speedup, on your actual data, the timing profile may be slightly different. PS:由于我们使用假数据运行较大的样本来观察加速,因此在您的实际数据上,时序配置可能会略有不同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.