简体   繁体   中英

Condensed 1D numpy array to 2D Hamming distance matrix

I am looking for a reliable way to convert a condensed Hamming distance array generated with the scipy.spatial.distance.pdist function into its corresponding 2D Hamming distance matrix. I am aware of the scipy.spatial.distance.squareform function. However, I am computing Hamming distances for up to 100,000 x 100,000 matrices, which results in a MemoryError in Python.

I am looking for a way to convert the condensed matrix into its square form on a row-by-row basis. Does anyone know of a reliable (and possibly fast) implementation using NumPy and/or related packages?

I need to perform numpy.sum computations on each row but cannot afford to store the full N x N matrix in memory.

Currently, I am using a nested loop to iterate over my input matrix and calculate the distances "manually".

identity = 0.7
hamming_sum = numpy.zeros(msa_mat.shape[0], dtype=numpy.float64)
hamming_dist = numpy.zeros(msa_mat.shape[0], dtype=numpy.float64)
for i, row1 in enumerate(msa_mat):
    hamming_dist.fill(0)
    for j, row2 in enumerate(msa_mat):
        if i != j:
            hamming_dist[j] = scipy.spatial.distance.hamming(row1, row2)
    hamming_sum[i] = numpy.sum(numpy.where(hamming_dist < (1 - identity), 1, 0), axis=0)

Edit 1

My data looks something like the following matrix:

>>> a = numpy.array([1, 2, 3, 4, 5, 4, 5, 4, 2, 7, 9, 4, 1, 5, 6, 2, 3, 6], dtype=float).reshape(3, 6)
>>> a
array([[ 1.,  2.,  3.,  4.,  5.,  4.],
       [ 5.,  4.,  2.,  7.,  9.,  4.],
       [ 1.,  5.,  6.,  2.,  3.,  6.]])

I would like to compute the Hamming distance for this matrix. For small matrices, this can easily be done using the cdist command in SciPy returing a result like the following:

>>> cdist(a, a, 'hamming')
array([[ 0.        ,  0.83333333,  0.83333333],
       [ 0.83333333,  0.        ,  1.        ],
       [ 0.83333333,  1.        ,  0.        ]])

However, in cases with much larger matrices, this raises a MemoryError in Python.

I am aware that I can calculate in such cases the Hamming distances using the pdist command. This returns the distances for the upper triangle in a 1D array.

>>> pdist(a, 'hamming')
array([ 0.83333333,  0.83333333,  1.        ])

My issue relates to the fact that I do not know how to reconstruct the cdist matrix from the pdist result on a per-row basis .

I am aware of the squareform function but that again raises MemoryErrors for large matrices.

Here's an approach using ID based summing with np.bincount -

def getdists_v1(a):
    n = a.shape[0]
    r,c = np.triu_indices(n,1)
    vals = pdist(a, 'hamming') < (1 - identity)
    return np.bincount(r,vals,minlength=n) + np.bincount(c,vals,minlength=n) + 1

Here's another bin-based one with focus on memory efficiency using np.add.reduceat -

def getdists_v2(a):
    n = a.shape[0]
    nr = (n*(n-1))//2
    vals = pdist(a, 'hamming') < (1 - identity)

    sfidx = n*np.arange(0,n-1) - np.arange(n-1).cumsum()
    id_arr = np.ones(nr,dtype=int)
    id_arr[sfidx[1:]] = -np.arange(n-3,-1,-1)
    c = id_arr.cumsum()

    out = np.bincount(c,vals)+1
    out[:n-1] += np.add.reduceat(vals,sfidx)
    return out

Here's another one that loops to compute the lower triangular region row-wise summations -

def getdists_v3(a):
    n = a.shape[0]
    r_arr = np.arange(n-1)
    cr_arr = r_arr.cumsum()
    sfidx_c = (n-1)*r_arr - cr_arr
    vals = pdist(a, 'hamming') < (1 - identity)
    out = np.zeros(n)
    for i in range(n-1):
        out[i+1] = np.count_nonzero(vals[sfidx_c[:i+1] + i])
    out[:n-1] += np.add.reduceat(vals, n*r_arr - cr_arr)
    out[:] += 1
    return out

One way to avoid the memory problem is to use cdist in batches:

import numpy as np
from scipy.spatial.distance import cdist


def count_hamming_neighbors(points, radius, batch_size=None):
    n = len(points)

    if batch_size is None:
        batch_size = min(n, 1000)

    hamming_sum = np.zeros(n, dtype=int)

    num_full_batches, last_batch = divmod(n, batch_size)
    batches = [batch_size]*num_full_batches
    if  last_batch != 0:
        batches.append(last_batch)
    for k, batch in enumerate(batches):
        i = batch_size*k
        dists = cdist(points[i:i+batch], points, metric='hamming')
        hamming_sum[i:i+batch] = (dists < radius).sum(axis=1)

    return hamming_sum

Here's a comparsion to Divakar's getdists_v3(a) , to ensure that we are getting the same results:

In [102]: np.random.seed(12345)

In [103]: a = np.random.randint(0, 4, size=(16, 4))

In [104]: count_hamming_neighbors(a, 0.3)
Out[104]: array([1, 1, 3, 2, 2, 1, 2, 1, 3, 2, 3, 2, 2, 1, 2, 2])

In [105]: identity = 0.7

In [106]: getdists_v3(a)
Out[106]: 
array([ 1.,  1.,  3.,  2.,  2.,  1.,  2.,  1.,  3.,  2.,  3.,  2.,  2.,
        1.,  2.,  2.])

Compare timing for a bigger array:

In [113]: np.random.seed(12345)

In [114]: a = np.random.randint(0, 4, size=(10000, 4))

In [115]: %timeit hamming_sum = count_hamming_neighbors(a, 0.3)
1 loop, best of 3: 714 ms per loop

In [116]: %timeit v3result = getdists_v3(a)
1 loop, best of 3: 1.05 s per loop

So it is a little faster. Changing the batch size affects the performance, sometimes in surprising ways:

In [117]: %timeit hamming_sum = count_hamming_neighbors(a, 0.3, batch_size=250)
1 loop, best of 3: 643 ms per loop

In [118]: %timeit hamming_sum = count_hamming_neighbors(a, 0.3, batch_size=2000)
1 loop, best of 3: 875 ms per loop

In [119]: %timeit hamming_sum = count_hamming_neighbors(a, 0.3, batch_size=125)
1 loop, best of 3: 664 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM