简体   繁体   中英

how to quickly count equal element in a numpy.array?

I have a python matrix

leafs = np.array([[1,2,3],[1,2,4],[2,3,4],[4,2,1]])

I would like to compute for each couple of rows the number of time they have the same element.

In this case I would get a 4x4 matrix proximity

proximity = array([[3, 2, 0, 1],
                   [2, 3, 1, 1],
                   [0, 1, 3, 0],
                   [1, 1, 0, 3]])

This is the code that I am currently using.

proximity = []

for i in range(n):
 print(i)
 proximity.append(np.apply_along_axis(lambda x: sum(x==leafs[i, :]), axis=1,
                                      arr=leafs))

I need a faster solution

EDIT: The accepted solution does not work in this example

    >>> type(f.leafs)
<class 'numpy.ndarray'>
>>> f.leafs.shape
(7210, 1000)
>>> f.leafs.dtype
dtype('int64')

>>> f.leafs.reshape(7210, 1, 1000) == f.leafs.reshape(1, 7210, 1000)
False
>>> f.leafs
array([[ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       ..., 
       [139,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157]])
>>> 

Here's one way, using broadcasting. Be warned: the temporary array eq has shape (nrows, nrows, ncols) , so if nrows is 4000 and ncols is 1000, eq will require 16GB of memory.

In [38]: leafs
Out[38]: 
array([[1, 2, 3],
       [1, 2, 4],
       [2, 3, 4],
       [4, 2, 1]])

In [39]: nrows, ncols = leafs.shape

In [40]: eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)

In [41]: proximity = eq.sum(axis=-1)

In [42]: proximity
Out[42]: 
array([[3, 2, 0, 1],
       [2, 3, 1, 1],
       [0, 1, 3, 0],
       [1, 1, 0, 3]])

Also note that this solution is inefficient: proximity is symmetric, and the diagonal is always equal to ncols , but this solution computes the full array, so it does more than twice as much work as necessary.

Warren Weckesser offered a very beautiful solution using broadcasting. However, even a straightforward approach using a loop can have comparable performance. np.apply_along_axis is slow in your initial solution because it does not take advantage of vectorization. However the following fixes it:

def proximity_1(leafs):
    n = len(leafs)
    proximity = np.zeros((n,n))
    for i in range(n):
        proximity[i] = (leafs == leafs[i]).sum(1)  
    return proximity

You could also use a list comprehension to make the above code more concise. The difference is that np.apply_along_axis would loop through all the rows in a non-optimized manner, while leafs == leafs[i] will take advantage of numpy speed.

The solution from Warren Weckesser truly shows numpy 's beauty. However, it includes the overhead of creating an intermediate 3-d array of size nrows*nrows*ncols . So if you have large data, the simple loop might be more efficient.

Here's an example. Below is code offered by Warren Weckesser, wrapped in a function. (I don't know what are the code copyright rules here, so I assume this reference is enough :) )

def proximity_2(leafs):
    nrows, ncols = leafs.shape    
    eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)
    proximity = eq.sum(axis=-1)  
    return proximity

Now let's evaluate the performance on an array of random integers of size 10000 x 100.

leafs = np.random.randint(1,100,(10000,100))
time proximity_1(leafs)
>> 28.6 s
time proximity_2(leafs) 
>> 35.4 s 

I ran both examples in an IPython environment on the same machine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM