简体   繁体   English

如何加快距离矩阵的计算?

[英]How to speed up Distance Matrix calculation?

Task: I am new to python and currently working on a clustering task where I compute the similarity between users clickstreams. 任务:我是python的新手,目前正在执行集群任务,在该集群中,我计算用户点击流之间的相似度。 Therefore I am using the Jaccard Index to compare the click sets (clickstream) of each two users, saving the results in an NxN distance matrix and then performing Wards clustering algorithm on this distance matrix. 因此,我使用Jaccard Index比较每个两个用户的点击集(clickstream),将结果保存在NxN距离矩阵中,然后对该距离矩阵执行Wards聚类算法。

Problem: Having tried out everything with data from 1 day (around 85 Session IDs/users), it worked like a charm. 问题:从1天起就尝试了所有数据(大约85个会话ID /用户),它像一种魅力一样工作。 Now with 949 unique users, the computation takes ages, probably due to my inefficient code. 现在有949个唯一用户,计算可能要花一些时间,这可能是由于我的代码效率低下所致。

Here is a snapshot from my df stack_dataframe: 9329 rows x 2 columns 这是我的df stack_dataframe的快照:9329行x 2列

Here is my code for computing the distance matrix: 这是我计算距离矩阵的代码:

import itertools
import pandas as pd

# Method to compute Jaccard similarity index between two sets
def compute_jaccard(session1_vals, session2_vals):
    intersection = session1_vals.intersection(session2_vals)
    union = session1_vals.union(session2_vals)
    jaccard = len(intersection)/float(len(union))
    return jaccard


stID_keys = stack_dataframe.groupby(['Session ID']).groups.keys()
print("hallo1")
New_stack_df = stack_dataframe.pivot(columns="Session ID", values="Page")
print("hallo2")
sim_df = pd.DataFrame(columns=ID_keys, index=ID_keys)

# Iterate through columns and compute metric
test = 0
print("hallo3")
for col_pair in itertools.combinations(New_stack_df.columns, 2):
   print(test)
   test += 1
   u1= col_pair[0]
   u2 = col_pair[1]
   sim_df.loc[col_pair] = compute_jaccard(set(New_stack_df[u1].dropna()), 
   set(New_stack_df[u2].dropna()))


print(sim_df)

Any help much appreciated, thanks! 任何帮助,不胜感激,谢谢!

Your method is highly inefficient. 您的方法效率极低。 The inefficiency is mainly due to two reasons : 效率低下主要是由于两个原因:

  • The O(n^2) loop in itertools.combinations(..) itertools.combinations(..)的O(n ^ 2)循环
  • Heavy pandas usage. 大熊猫的使用。 Though easy to use, pandas is slightly inefficient due to lot of book keeping. 尽管易于使用,但由于记账量大,熊猫效率略低。

We solve these by 我们通过解决这些

  • Using scipy distance.cdist(source written in c) to calculate the distances between all pairs. 使用scipy distance.cdist(用c编写的源代码)计算所有线对之间的距离。
  • Use numpy instead of pandas 使用numpy代替pandas
  • Jit compile the jaccard distance function as it is being called large number of times. Jit会多次调用jaccard距离函数。

So the code is: 所以代码是:

from __future__ import division
import time
import pandas as pd
import numpy as np
from scipy.spatial import distance
import numba as nb

def _make_2D_array(lis):
    n = len(lis)
    lengths = np.array([len(x) for x in lis])
    max_len = max(lengths)
    arr = np.zeros((n, max_len))
    for i in range(n):
        arr[i, :lengths[i]] = lis[i]
    return arr, lengths

@nb.jit(nopython=True, cache=True)
def compute_jaccard(session1, session2, arr, lengths):
    """Jited funciton to calculate jaccard distance
    """
    session1, session2 = session1[0], session2[0]
    intersection, union = 0, 0

    if(lengths[session2] > lengths[session1]):
        session1, session2 = session2, session1

    marked = np.zeros((lengths[session2],))

    for x in arr[session1][:lengths[session1]]:
        x_in_2 = arr[session2][:lengths[session2]] == x
        marked[x_in_2] = 1
        if(np.any(x_in_2)):
            intersection+=1
            union+=1
        else:
            union+=1

    union+=np.sum(marked==0)

    jaccard = intersection/union

    return jaccard

def calculate_sim_between(stack_dataframe):
    # get integer encodings for session ids and pages
    session_encode, sessions = pd.factorize(stack_dataframe["Session ID"])
    page_encode, pages = pd.factorize(stack_dataframe["Page"])

    # take unique pages in each session 
    pages_in_sessions = [np.unique(page_encode[session_encode==x]) for x in range(len(sessions))]

    # convert the list of lists to numpy array
    arr, lengths = _make_2D_array(pages_in_sessions)

    # make a dummy array like [[0], [1], [2] ...] to get the distances between every pair of sessions
    _sessions = np.arange(len(sessions))[:, np.newaxis]

    # get the distances
    distances = distance.cdist(_sessions, _sessions, compute_jaccard, arr=arr, lengths=lengths)

    sim_df = pd.DataFrame(distances, columns=sessions, index=sessions)
    return sim_df

Notice that, we numba compile compute_jaccard function to eke out event the single level loop time. 请注意,我们通过numba编译compute_jaccard函数来触发事件的单级循环时间。 If you don't want to install numba, just comment out the decorator. 如果您不想安装numba,只需注释掉装饰器即可。

Timings: 时序:

On this sample data: 在此样本数据上:

from faker import Faker
fake = Faker()
stack_dataframe = pd.DataFrame({"Session ID":[fake.name() for i in range(200)], "Page":[fake.name() for i in range(200)]})

timings are 时间是

Your method : 69.465s 您的方法:69.465s

Without jit : 0.374s 无抖动:0.374s

With jit(on second run, to discount for the compile time) : 0.147s 使用jit(第二次运行时,可减少编译时间):0.147s

PS : Since we use fake data to run on a largish sample to observe the speedup, on your actual data, the timing profile may be slightly different. PS:由于我们使用假数据运行较大的样本来观察加速,因此在您的实际数据上,时序配置可能会略有不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM