繁体   English   中英

找到两对总和为相同值的对

[英]Find two pairs of pairs that sum to the same value

我有随机的2d数组,我使用

import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(n,n))

我想确定矩阵是否有两对行,它们总和到同一行向量。 我正在寻找一种快速的方法来做到这一点。 我目前的方法只是尝试所有可能性。

for pair in  combinations(combinations(range(n), 2), 2):
    if (np.array_equal(A[pair[0][0]] + A[pair[0][1]], A[pair[1][0]] + A[pair[1][1]] )):
        print "Pair found", pair

一个适用于n = 100方法真的很棒。

基于您的问题中的代码,并假设您实际上正在寻找总和等于相同行向量的成对行对 ,您可以执行以下操作:

def findMatchSets(A):
   B = A.transpose()
   pairs = tuple(combinations(range(len(A[0])), 2))
   matchSets = [[i for i in pairs if B[0][i[0]] + B[0][i[1]] == z] for z in range(3)]
   for c in range(1, len(A[0])):
      matchSets = [[i for i in block if B[c][i[0]] + B[c][i[1]] == z] for z in range(3) for block in matchSets]
      matchSets = [block for block in matchSets if len(block) > 1]
      if not matchSets:
         return []
   return matchSets

这基本上将矩阵分层为等价集,在考虑一列之后将其相加到相同的值,然后是两列,然后是三列,依此类推,直到它到达最后一列或者没有剩余的等价集比一个成员(即没有这样的一对)。 这对于100x100阵列工作正常,主要是因为当n很大时(n *(n + 1)/ 2组合与3 ^ n个可能的矢量和相比,两对行相加到同一行向量的几率非常小) 。

UPDATE

更新的代码允许按要求搜索所有行的n个大小的子集对。 根据原始问题,默认值为n = 2:

def findMatchSets(A, n=2):
   B = A.transpose()
   pairs = tuple(combinations(range(len(A[0])), n))
   matchSets = [[i for i in pairs if sum([B[0][i[j]] for j in range(n)]) == z] for z in range(n + 1)]
   for c in range(1, len(A[0])):
      matchSets = [[i for i in block if sum([B[c][i[j]] for j in range(n)]) == z] for z in range(n + 1) for block in matchSets]
      matchSets = [block for block in matchSets if len(block) > 1]
      if not matchSets:
      return []
   return matchSets

这是一个纯粹的numpy解决方案; 没有广泛的时间,但我必须先将n推高到500才能看到我的光标在完成之前闪烁一次。 虽然它是内存密集型的,并且由于更大n的内存需求而会失败。 无论哪种方式,我都有这样的直觉:无论如何,找到这样一个向量的几率会随着大的n而减小。

import numpy as np

n = 100
A = np.random.randint(2, size=(n,n)).astype(np.int8)

def base3(a):
    """
    pack the last axis of an array in base 3
    40 base 3 numbers per uint64
    """
    S = np.array_split(a, a.shape[-1]//40+1, axis=-1)
    R = np.zeros(shape=a.shape[:-1]+(len(S),), dtype = np.uint64)
    for i in xrange(len(S)):
        s = S[i]
        r = R[...,i]
        for j in xrange(s.shape[-1]):
            r *= 3
            r += s[...,j]
    return R

def unique_count(a):
    """returns counts of unique elements"""
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return unique, count

def voidview(arr):
    """view the last axis of an array as a void object. can be used as a faster form of lexsort"""
    return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])

def has_pairs_of_pairs(A):
    #optional; convert rows to base 3
    A = base3(A)
    #precompute sums over a lower triangular set of all combinations
    rowsums = sum(A[I] for I in np.tril_indices(n,-1))
    #count the number of times each row occurs by sorting
    #note that this is not quite O(n log n), since the cost of handling each row is also a function of n
    unique, count = unique_count(voidview(rowsums))
    #print if any pairs of pairs exist;
    #computing their indices is left as an excercise for the reader
    return np.any(count>1)

from time import clock
t = clock()
for i in xrange(100):
    print has_pairs_of_pairs(A)
print clock()-t

编辑:包括base-3包装; 现在n = 2000是可行的,大约2gb的mem和几秒钟的处理

编辑:添加了一些时间; 我的i7m每次通话n = 100只需要5ms。

您当前的代码不会测试总和为相同值的行对。

假设这实际上是你想要的,最好坚持纯粹的numpy。 这将生成具有相等总和的所有行的索引。

import numpy as np

n = 100
A = np.random.randint(2, size=(n,n))

rowsum = A.sum(axis=1)

unique, inverse = np.unique(rowsum, return_inverse = True)

count = np.zeros_like(unique)
np.add.at(count, inverse, 1)

for p in unique[count>1]:
    print p, np.nonzero(rowsum==p)[0]

这是一种“懒惰”方法,使用“仅”4gb内存扩展到n = 10000,并且每次调用大约10秒完成。 最坏情况复杂度为O(n ^ 3),但对于随机数据,预期性能为O(n ^ 2)。 乍一看,似乎你需要O(n ^ 3)操作; 每行组合需要至少生产和检查一次。 但我们不需要看整行。 相反,我们可以在行对的比较中执行早期退出策略,一旦明确它们对我们没用; 对于随机数据,我们可能会在我们考虑连续的所有列之前很久就得出这个结论。

import numpy as np

n = 10
#also works for non-square A
A = np.random.randint(2, size=(n*2,n)).astype(np.int8)
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]


def base_pack_lazy(a, base, dtype=np.uint64):
    """
    pack the last axis of an array as minimal base representation
    lazily yields packed columns of the original matrix
    """
    a = np.ascontiguousarray( np.rollaxis(a, -1))
    init = np.zeros(a.shape[1:], dtype)
    packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
    for columns in np.array_split(a, (len(a)-1)//packing+1):
        yield reduce(
            lambda acc,inc: acc*base+inc,
            columns,
            init)

def unique_count(a):
    """returns counts of unique elements"""
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)        #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
    return unique, count, inverse

def has_identical_row_sums_lazy(A, combinations_index):
    """
    compute the existence of combinations of rows summing to the same vector,
    given an nxm matrix A and an index matrix specifying all combinations

    naively, we need to compute the sum of each row combination at least once, giving n^3 computations
    however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
    all nicely vectorized of course
    """

    multiplicity, combinations = combinations_index.shape
    #list of indices into combinations_index, denoting possibly interacting combinations
    active_combinations = np.arange(combinations, dtype=np.uint32)

    for packed_column in base_pack_lazy(A, base=multiplicity+1):       #loop over packed cols
        #compute rowsums only for a fixed number of columns at a time.
        #this is O(n^2) rather than O(n^3), and after considering the first column,
        #we can typically already exclude almost all rowpairs
        partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
        #find duplicates in this column
        unique, count, inverse = unique_count(partial_rowsums)
        #prune those pairs which we can exclude as having different sums, based on columns inspected thus far
        active_combinations = active_combinations[count[inverse] > 1]
        #early exit; no pairs
        if len(active_combinations)==0:
            return False
    return True

def has_identical_triple_row_sums(A):
    n = len(A)
    idx = np.array( [(i,j,k)
        for i in xrange(n)
            for j in xrange(n)
                for k in xrange(n)
                    if i<j and j<k], dtype=np.uint16)
    idx = np.ascontiguousarray( idx.T)
    return has_identical_row_sums_lazy(A, idx)

def has_identical_double_row_sums(A):
    n = len(A)
    idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
    return has_identical_row_sums_lazy(A, idx)


from time import clock
t = clock()
for i in xrange(10):
    print has_identical_double_row_sums(A)
    print has_identical_triple_row_sums(A)
print clock()-t

如上所述,扩展为包括行的三元组总和的计算。 对于n = 100,这仍然只需要约0.2秒

编辑:一些清理; edit2:一些更清理

如果您需要做的就是确定是否存在这样的一对,您可以:

exists_unique = np.unique(A.sum(axis=1)).size != A.shape[0]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM