简体   繁体   English

Python中矢量的高性能全面比较

[英]High performance all-to-all comparison of vectors in Python

First to tell about the background: several methods for comparison between clusterings relies on so called pair counting. 首先讲述背景:聚类之间比较的几种方法依赖于所谓的配对计数。 We have two vectors of flat clusterings a and b over the same n entities. 我们在相同的n实体上有两个平面聚类矢量ab At pair counting for all possible pairs of entities we check whether if those belong to the same cluster in both, or to the same in a but different in b , or the opposite, or to different clusters in both. 在对所有可能的对实体的计数,我们检查是否如果这些属于在两个相同的集群,或者在相同的a ,但不同的在b ,或相反,或在两个不同的簇。 This way we get 4 counts, let's call them n11, n10, n01, n00 . 这样我们得到4个计数,我们称之为n11, n10, n01, n00 These are input for different metrics. 这些是不同指标的输入。

When the number of entities is around 10,000, and the number of clusterings to compare is dozens or more, the performance becomes an issue, as the number of pairs is 10^8 for each comparison, and for an all-to-all comparison of clusterings this needs to be performed 10^4 times. 当实体数量大约为10,000,并且要比较的聚类数量为几十个或更多时,性能就成了一个问题,因为每次比较的对数为10^8 ,并且对于全部比较群集这需要执行10^4次。

With a naive Python implementation it took really forever, so I turned to Cython and numpy. 有了一个天真的Python实现,它花了很长时间,所以我转向Cython和numpy。 This way I could push the time for one comparison down to around 0.9-3.0 seconds. 通过这种方式,我可以将一次比较的时间缩短到0.9-3.0秒左右。 Which still means half day runtime in my case. 在我的情况下,这仍然意味着半天运行时间。 I am wondering if you see any possibility for performance achievement, with some clever algorithm or C library, or whatever. 我想知道你是否看到任何表现成就的可能性,有一些聪明的算法或C库,或者其他什么。

Here are my attempts: 以下是我的尝试:

1) This counts without allocating huge arrays for all the pairs, takes 2 membership vectors a1, a2 of length n , and returns the counts: 1)这计数不为所有对分配大数组,取2个成员向量a1, a2长度为n ,并返回计数:

cimport cython
import numpy as np
cimport numpy as np

ctypedef np.int32_t DTYPE_t

@cython.boundscheck(False)
def pair_counts(
    np.ndarray[DTYPE_t, ndim = 1] a1,
    np.ndarray[DTYPE_t, ndim = 1] a2,
    ):

    cdef unsigned int a1s = a1.shape[0]
    cdef unsigned int a2s = a2.shape[0]

    cdef unsigned int n11, n10, n01, n00
    n11 = n10 = n01 = n00 = 0
    cdef unsigned int j0

    for i in range(0, a1s - 1):
        j0 = i + 1
        for j in range(j0, a2s):
            if a1[i] == a1[j] and a2[i] == a2[j]:
                n11 += 1
            elif a1[i] == a1[j]:
                n10 += 1
            elif a2[i] == a2[j]:
                n01 += 1
            else:
                n00 += 1

    return n11, n10, n01, n00

2) This first calculates comembership vectors (length n * (n-1) / 2 , one element for each entity pair) for each of the 2 clusterings, then calculates the counts from these vectors. 2)这首先为2个聚类中的每一个计算聚类矢量(长度n * (n-1) / 2 ,每个实体对一个元素),然后根据这些矢量计算计数。 It allocates ~20-40M memory at each comparison, but interestingly, faster than the previous. 它在每次比较时分配了大约20-40M的内存,但有趣的是,它比以前更快。 Note: c is a wrapper class around a clustering, having the usual membership vector, but also a c.members dict which contains the indices of entities for each cluster in numpy arrays. 注意: c是一个围绕聚类的包装类,具有通常的成员资格向量,但也是一个c.members dict,其中包含numpy数组中每个集群的实体索引。

cimport cython
import numpy as np
cimport numpy as np

@cython.boundscheck(False)
def comembership(c):
    """
    Returns comembership vector, where each value tells if one pair
    of entites belong to the same group (1) or not (0).
    """
    cdef int n = len(c.memberships)
    cdef int cnum = c.cnum
    cdef int ri, i, ii, j, li

    cdef unsigned char[:] cmem = \
        np.zeros((int(n * (n - 1) / 2), ), dtype = np.uint8)

    for ci in xrange(cnum):
        # here we use the members dict to have the indices of entities
        # in cluster (ci), as a numpy array (mi)
        mi = c.members[ci]
        for i in xrange(len(mi) - 1):
            ii = mi[i]
            # this is only to convert the indices of an n x n matrix
            # to the indices of a 1 x (n x (n-1) / 2) vector:
            ri = n * ii - 3 * ii / 2 - ii ** 2 / 2 - 1
            for j in mi[i+1:]:
                # also here, adding j only for having the correct index
                li = ri + j
                cmem[li] = 1

    return np.array(cmem)

def pair_counts(c1, c2):
    p1 = comembership(c1)
    p2 = comembership(c2)
    n = len(c1.memberships)

    a11 = p1 * p2

    n11 = a11.sum()
    n10 = (p1 - a11).sum()
    n01 = (p2 - a11).sum()
    n00 = n - n10 - n01 - n11

    return n11, n10, n01, n00

3) This is a pure numpy based solution with creating an nxn boolean array of comemberships of entity pairs. 3)这是建立一个纯numpy的基于溶液nxn实体对comemberships的布尔数组。 The inputs are the membership vectors ( a1, a2 ). 输入是隶属向量( a1, a2 )。

def pair_counts(a1, a2):

    n = len(a1)
    cmem1 = a1.reshape([n,1]) == a1.reshape([1,n])
    cmem2 = a2.reshape([n,1]) == a2.reshape([1,n])

    n11 = int(((cmem1 == cmem2).sum() - n) / 2)
    n10 = int((cmem1.sum() - n) / 2) - n11
    n01 = int((cmem2.sum() - n) / 2) - n11
    n00 = n - n11 - n10 - n01

    return n11, n10, n01, n00

Edit: example data 编辑:示例数据

import numpy as np

a1 = np.random.randint(0, 1868, 14440, dtype = np.int32)
a2 = np.random.randint(0, 484, 14440, dtype = np.int32)

# to have the members dicts used in example 2:
def get_cnum(a):
    """
    Returns number of clusters.
    """
    return len(np.unique(a))

def get_members(a):
    """
    Returns a dict with cluster numbers as keys and member entities
    as sorted numpy arrays.
    """
    members = dict(map(lambda i: (i, []), range(max(a) + 1)))
    list(map(lambda m: members[m[1]].append(m[0]),
        enumerate(a)))
    members = dict(map(lambda m:
       (m[0], np.array(sorted(m[1]), dtype = np.int)),   
       members.items()))
    return members

members1 = get_members(a1)
members2 = get_members(a2)
cnum1 = get_cnum(a1)
cnum2 = get_cnum(a2)

The approach based on sorting has merit, but can be done a lot simpler: 基于排序的方法具有优点,但可以做得更简单:

def pair_counts(a, b):
    n = a.shape[0]  # also b.shape[0]

    counts_a = np.bincount(a)
    counts_b = np.bincount(b)
    sorter_a = np.argsort(a)

    n11 = 0
    same_a_offset = np.cumsum(counts_a)
    for indices in np.split(sorter_a, same_a_offset):
        b_check = b[indices]
        n11 += np.count_nonzero(b_check == b_check[:,None])

    n11 = (n11-n) // 2
    n10 = (np.sum(counts_a**2) - n) // 2 - n11
    n01 = (np.sum(counts_b**2) - n) // 2 - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

If this method is coded efficiently in Cython there's another speedup (probably ~20x) to be gained. 如果在Cython中有效编码此方法,则可以获得另一个加速(可能是~20x)。


Edit: 编辑:

I found a way to completely vectorize the procedure and lower the time complexity: 我找到了一种完全矢量化程序并降低时间复杂度的方法:

def sizes2count(a, n):
    return (np.inner(a, a) - n) // 2

def pair_counts_vec_nlogn(a, b):
    # Size of "11" clusters (a[i]==a[j] & b[i]==b[j])
    ab = a * b.max() + b  # make sure max(a)*max(b) fits the dtype!
    _, sizes = np.unique(ab, return_counts=True)

    # Calculate the counts for each type of pairing
    n = len(a)  # also len(b)
    n11 = sizes2count(sizes, n)
    n10 = sizes2count(np.bincount(a), n) - n11
    n01 = sizes2count(np.bincount(b), n) - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

def pair_counts_vec_linear(a, b):
    # Label "11" clusters (a[i]==a[j] & b[i]==b[j])
    ab = a * b.max() + b

    # Calculate the counts for each type of pairing
    n = len(a)  # also len(b)
    n11 = sizes2count(np.bincount(ab), n)
    n10 = sizes2count(np.bincount(a), n) - n11
    n01 = sizes2count(np.bincount(b), n) - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

Sometimes the O(n log(n)) algorithm is faster than the linear one, because the linear one uses max(a)*max(b) storage. 有时O(n log(n))算法比线性算法快,因为线性算法使用max(a)*max(b)存储。 Naming can probably be improved, I'm not really familiar with the terminology. 命名可能会有所改进,我对术语并不熟悉。

To compare two clusterings A and B in linear time: 在线性时间内比较两个聚类AB

  1. Iterate through clusters in A . 迭代A簇。 Let the size of each cluster be a_i . 让每个簇的大小为a_i The total number of pairs in the same cluster in A is the total of all a_i*(a_i-1)/2 . A同一簇中A对的总数是所有a_i*(a_i-1)/2的总和。
  2. Partition each A-cluster according to its cluster in B . 根据B的集群对每个A集群进行分区。 Let the size of each partition be b_j . 让每个分区的大小为b_j The total number of pairs in the same cluster in both A and B is the total of all b_j *(b_j-1)/2 . AB同一簇中的对的总数是所有b_j *(b_j-1)/2
  3. The difference between the two is the total number of pairs that are in the same cluster in A but not B 两者之间的差异是A中同一群集中的对的总数,但不是B.
  4. Iterate through the custers in B to get the total number of pairs in the same cluster in B , and subtract the result from (2) to get pairs in the same cluster in B but not A . 通过在custers迭代B得到在相同的簇中对总数B ,和减去(2),以获得对所述结果在同一集群中的B但不是A
  5. The sum of the above 3 results is the number of pairs that are the same in either A or B. Subtract from n*(n-1)/2 to get the total number of pairs that are in different clusters in A and B 上述3个结果的总和是A或B中相同的对数。从n *(n-1)/ 2减去得到A和B中不同簇中的对的总数

The partitioning in step (2) is done by making a dictionary mapping item -> cluster for B and then looking up each item in each A-cluster. 步骤(2)中的分区是通过为B生成字典映射项 - >集群,然后在每个A集群中查找每个项来完成的。 If you're cross-comparing lots of clusterings, you can save a lot of time by computing these maps just once for each clustering and keeping them around. 如果您正在交叉比较大量聚类,则可以通过为每个聚类计算这些映射一次并保留它们来节省大量时间。

You do not need to enumerate and count the pairs. 并不需要列举和算对。

Instead, compute a confusion matrix containing the intersection sizes of each cluster from the first clustering with every cluster of the second clustering (this is one loop over all objects), then compute the number of pairs from this matrix using the equation n*(n-1)/2 . 相反,计算一个混淆矩阵,其中包含第一个聚类中每个聚类的交集大小与第二个聚类的每个聚类(这是所有对象上的一个循环),然后使用等式n*(n-1)/2 )计算此矩阵中的对的数量n*(n-1)/2

This reduces your runtime from O(n^2) to O(n), so it should give you a considerable speedup. 这会将运行时间从O(n ^ 2)减少到O(n),因此它可以为您提供相当大的加速。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM