[英]High performance all-to-all comparison of vectors in Python
First to tell about the background: several methods for comparison between clusterings relies on so called pair counting. 首先讲述背景:聚类之间比较的几种方法依赖于所谓的配对计数。 We have two vectors of flat clusterings a
and b
over the same n
entities. 我们在相同的n
实体上有两个平面聚类矢量a
和b
。 At pair counting for all possible pairs of entities we check whether if those belong to the same cluster in both, or to the same in a
but different in b
, or the opposite, or to different clusters in both. 在对所有可能的对实体的计数,我们检查是否如果这些属于在两个相同的集群,或者在相同的a
,但不同的在b
,或相反,或在两个不同的簇。 This way we get 4 counts, let's call them n11, n10, n01, n00
. 这样我们得到4个计数,我们称之为n11, n10, n01, n00
。 These are input for different metrics. 这些是不同指标的输入。
When the number of entities is around 10,000, and the number of clusterings to compare is dozens or more, the performance becomes an issue, as the number of pairs is 10^8
for each comparison, and for an all-to-all comparison of clusterings this needs to be performed 10^4
times. 当实体数量大约为10,000,并且要比较的聚类数量为几十个或更多时,性能就成了一个问题,因为每次比较的对数为10^8
,并且对于全部比较群集这需要执行10^4
次。
With a naive Python implementation it took really forever, so I turned to Cython and numpy. 有了一个天真的Python实现,它花了很长时间,所以我转向Cython和numpy。 This way I could push the time for one comparison down to around 0.9-3.0
seconds. 通过这种方式,我可以将一次比较的时间缩短到0.9-3.0
秒左右。 Which still means half day runtime in my case. 在我的情况下,这仍然意味着半天运行时间。 I am wondering if you see any possibility for performance achievement, with some clever algorithm or C library, or whatever. 我想知道你是否看到任何表现成就的可能性,有一些聪明的算法或C库,或者其他什么。
Here are my attempts: 以下是我的尝试:
1) This counts without allocating huge arrays for all the pairs, takes 2 membership vectors a1, a2
of length n
, and returns the counts: 1)这计数不为所有对分配大数组,取2个成员向量a1, a2
长度为n
,并返回计数:
cimport cython
import numpy as np
cimport numpy as np
ctypedef np.int32_t DTYPE_t
@cython.boundscheck(False)
def pair_counts(
np.ndarray[DTYPE_t, ndim = 1] a1,
np.ndarray[DTYPE_t, ndim = 1] a2,
):
cdef unsigned int a1s = a1.shape[0]
cdef unsigned int a2s = a2.shape[0]
cdef unsigned int n11, n10, n01, n00
n11 = n10 = n01 = n00 = 0
cdef unsigned int j0
for i in range(0, a1s - 1):
j0 = i + 1
for j in range(j0, a2s):
if a1[i] == a1[j] and a2[i] == a2[j]:
n11 += 1
elif a1[i] == a1[j]:
n10 += 1
elif a2[i] == a2[j]:
n01 += 1
else:
n00 += 1
return n11, n10, n01, n00
2) This first calculates comembership vectors (length n * (n-1) / 2
, one element for each entity pair) for each of the 2 clusterings, then calculates the counts from these vectors. 2)这首先为2个聚类中的每一个计算聚类矢量(长度n * (n-1) / 2
,每个实体对一个元素),然后根据这些矢量计算计数。 It allocates ~20-40M memory at each comparison, but interestingly, faster than the previous. 它在每次比较时分配了大约20-40M的内存,但有趣的是,它比以前更快。 Note: c
is a wrapper class around a clustering, having the usual membership vector, but also a c.members
dict which contains the indices of entities for each cluster in numpy arrays. 注意: c
是一个围绕聚类的包装类,具有通常的成员资格向量,但也是一个c.members
dict,其中包含numpy数组中每个集群的实体索引。
cimport cython
import numpy as np
cimport numpy as np
@cython.boundscheck(False)
def comembership(c):
"""
Returns comembership vector, where each value tells if one pair
of entites belong to the same group (1) or not (0).
"""
cdef int n = len(c.memberships)
cdef int cnum = c.cnum
cdef int ri, i, ii, j, li
cdef unsigned char[:] cmem = \
np.zeros((int(n * (n - 1) / 2), ), dtype = np.uint8)
for ci in xrange(cnum):
# here we use the members dict to have the indices of entities
# in cluster (ci), as a numpy array (mi)
mi = c.members[ci]
for i in xrange(len(mi) - 1):
ii = mi[i]
# this is only to convert the indices of an n x n matrix
# to the indices of a 1 x (n x (n-1) / 2) vector:
ri = n * ii - 3 * ii / 2 - ii ** 2 / 2 - 1
for j in mi[i+1:]:
# also here, adding j only for having the correct index
li = ri + j
cmem[li] = 1
return np.array(cmem)
def pair_counts(c1, c2):
p1 = comembership(c1)
p2 = comembership(c2)
n = len(c1.memberships)
a11 = p1 * p2
n11 = a11.sum()
n10 = (p1 - a11).sum()
n01 = (p2 - a11).sum()
n00 = n - n10 - n01 - n11
return n11, n10, n01, n00
3) This is a pure numpy based solution with creating an nxn
boolean array of comemberships of entity pairs. 3)这是建立一个纯numpy的基于溶液nxn
实体对comemberships的布尔数组。 The inputs are the membership vectors ( a1, a2
). 输入是隶属向量( a1, a2
)。
def pair_counts(a1, a2):
n = len(a1)
cmem1 = a1.reshape([n,1]) == a1.reshape([1,n])
cmem2 = a2.reshape([n,1]) == a2.reshape([1,n])
n11 = int(((cmem1 == cmem2).sum() - n) / 2)
n10 = int((cmem1.sum() - n) / 2) - n11
n01 = int((cmem2.sum() - n) / 2) - n11
n00 = n - n11 - n10 - n01
return n11, n10, n01, n00
Edit: example data 编辑:示例数据
import numpy as np
a1 = np.random.randint(0, 1868, 14440, dtype = np.int32)
a2 = np.random.randint(0, 484, 14440, dtype = np.int32)
# to have the members dicts used in example 2:
def get_cnum(a):
"""
Returns number of clusters.
"""
return len(np.unique(a))
def get_members(a):
"""
Returns a dict with cluster numbers as keys and member entities
as sorted numpy arrays.
"""
members = dict(map(lambda i: (i, []), range(max(a) + 1)))
list(map(lambda m: members[m[1]].append(m[0]),
enumerate(a)))
members = dict(map(lambda m:
(m[0], np.array(sorted(m[1]), dtype = np.int)),
members.items()))
return members
members1 = get_members(a1)
members2 = get_members(a2)
cnum1 = get_cnum(a1)
cnum2 = get_cnum(a2)
The approach based on sorting has merit, but can be done a lot simpler: 基于排序的方法具有优点,但可以做得更简单:
def pair_counts(a, b):
n = a.shape[0] # also b.shape[0]
counts_a = np.bincount(a)
counts_b = np.bincount(b)
sorter_a = np.argsort(a)
n11 = 0
same_a_offset = np.cumsum(counts_a)
for indices in np.split(sorter_a, same_a_offset):
b_check = b[indices]
n11 += np.count_nonzero(b_check == b_check[:,None])
n11 = (n11-n) // 2
n10 = (np.sum(counts_a**2) - n) // 2 - n11
n01 = (np.sum(counts_b**2) - n) // 2 - n11
n00 = n**2 - n - n11 - n10 - n01
return n11, n10, n01, n00
If this method is coded efficiently in Cython there's another speedup (probably ~20x) to be gained. 如果在Cython中有效编码此方法,则可以获得另一个加速(可能是~20x)。
Edit: 编辑:
I found a way to completely vectorize the procedure and lower the time complexity: 我找到了一种完全矢量化程序并降低时间复杂度的方法:
def sizes2count(a, n):
return (np.inner(a, a) - n) // 2
def pair_counts_vec_nlogn(a, b):
# Size of "11" clusters (a[i]==a[j] & b[i]==b[j])
ab = a * b.max() + b # make sure max(a)*max(b) fits the dtype!
_, sizes = np.unique(ab, return_counts=True)
# Calculate the counts for each type of pairing
n = len(a) # also len(b)
n11 = sizes2count(sizes, n)
n10 = sizes2count(np.bincount(a), n) - n11
n01 = sizes2count(np.bincount(b), n) - n11
n00 = n**2 - n - n11 - n10 - n01
return n11, n10, n01, n00
def pair_counts_vec_linear(a, b):
# Label "11" clusters (a[i]==a[j] & b[i]==b[j])
ab = a * b.max() + b
# Calculate the counts for each type of pairing
n = len(a) # also len(b)
n11 = sizes2count(np.bincount(ab), n)
n10 = sizes2count(np.bincount(a), n) - n11
n01 = sizes2count(np.bincount(b), n) - n11
n00 = n**2 - n - n11 - n10 - n01
return n11, n10, n01, n00
Sometimes the O(n log(n)) algorithm is faster than the linear one, because the linear one uses max(a)*max(b)
storage. 有时O(n log(n))算法比线性算法快,因为线性算法使用max(a)*max(b)
存储。 Naming can probably be improved, I'm not really familiar with the terminology. 命名可能会有所改进,我对术语并不熟悉。
To compare two clusterings A
and B
in linear time: 在线性时间内比较两个聚类A
和B
:
A
. 迭代A
簇。 Let the size of each cluster be a_i
. 让每个簇的大小为a_i
。 The total number of pairs in the same cluster in A
is the total of all a_i*(a_i-1)/2
. A
同一簇中A
对的总数是所有a_i*(a_i-1)/2
的总和。 B
. 根据B
的集群对每个A集群进行分区。 Let the size of each partition be b_j
. 让每个分区的大小为b_j
。 The total number of pairs in the same cluster in both A
and B
is the total of all b_j *(b_j-1)/2
. A
和B
同一簇中的对的总数是所有b_j *(b_j-1)/2
。 B
to get the total number of pairs in the same cluster in B
, and subtract the result from (2) to get pairs in the same cluster in B
but not A
. 通过在custers迭代B
得到在相同的簇中对总数B
,和减去(2),以获得对所述结果在同一集群中的B
但不是A
。 The partitioning in step (2) is done by making a dictionary mapping item -> cluster for B and then looking up each item in each A-cluster. 步骤(2)中的分区是通过为B生成字典映射项 - >集群,然后在每个A集群中查找每个项来完成的。 If you're cross-comparing lots of clusterings, you can save a lot of time by computing these maps just once for each clustering and keeping them around. 如果您正在交叉比较大量聚类,则可以通过为每个聚类计算这些映射一次并保留它们来节省大量时间。
You do not need to enumerate and count the pairs. 你并不需要列举和算对。
Instead, compute a confusion matrix containing the intersection sizes of each cluster from the first clustering with every cluster of the second clustering (this is one loop over all objects), then compute the number of pairs from this matrix using the equation n*(n-1)/2
. 相反,计算一个混淆矩阵,其中包含第一个聚类中每个聚类的交集大小与第二个聚类的每个聚类(这是所有对象上的一个循环),然后使用等式n*(n-1)/2
)计算此矩阵中的对的数量n*(n-1)/2
。
This reduces your runtime from O(n^2) to O(n), so it should give you a considerable speedup. 这会将运行时间从O(n ^ 2)减少到O(n),因此它可以为您提供相当大的加速。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.