简体   繁体   中英

Element-wise comparison of numpy arrays (Python)

I would like to ask a question for a numpy array below.

I have a dataset, which has 50 rows and 15 columns and I created a numpy array as such:

x=x.to_numpy()

My aim is compare each row with other rows(elementwise and except itself) and find whether if there is any row which all values smaller than that row.

Sample table:

a b c         
1 6 2
2 6 8
4 7 12
7 9 13

for example for row 1 and row2 there is no such a row. But rows 3,4 there is a row which all values of row 1 and row 2 are smaller than all those. So the algorithm should return the count 2 (which indicates the row 3 and 4).

Which Python code should be implemented to get this particular return.

I have tried a bunch of code, but could not reach a proper solution. So if anyone has an idea on that I would be appreciated.

Just use two loops and compare

import numpy as np

def f(x):
    count = 0

    for i in range(x.shape[0]):
        for j in range(x.shape[0]):
            if i == j:
                continue
            if np.all(x[i] > x[j]):
                count += 1
                break

    return count

x = np.array([[1, 6, 2], [2, 6, 8], [4, 7, 12], [7, 9, 13]])
print(f(x))

Edit: Pure-numpy solution

(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()

Explanation

The hard part is to think in 3d, so I start in 2d, with simple comparison of numbers. Imagine you have x=np.array([1,2,3,4]) and you want to compare all elements of x to all other elments of x, making a matrix 4x4 matrix of booleans.

What you would do, is to reshape x as a column of values on one side, and as a line on the other. So two 2d arrays: one 4x1, the other 1x4.

Then, when performing an operation among those two arrays, broadcasting will create a 4x4 array.

Just to visualize it, instead of comparison, let's do this

x=np.array([1,2,3,4])
x.reshape(-1,1) #is
#[[1],
# [2],
# [3],
# [4]]
x.reshape(1,-1) #is
# [ [1,2,3,4] ]
x.reshape(-1,1)*10+x.reshape(1,-1) #is therefore
# [[11, 12, 13, 14],
#  [21, 22, 23, 24],
#  [31, 32, 33, 34],
#  [41, 42, 43, 44]]

# Likewise 
x.reshape(-1,1)<x.reshape(1,-1) # is
#array([[False,  True,  True,  True],
#       [False, False,  True,  True],
#       [False, False, False,  True],
#       [False, False, False, False]])

So, all we have to do is the exact same thing. But with values being length-3 1d arrays instead of scalars:
x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)

Broadcasting will make this, as in previous example, a 2d array of all x[i]>x[j] , except that x[i] , x[j] and therefore x[i]>x[j] are not values, but 1d length 3 arrays. So our result is a 2d array of length 3 1d array, aka a 3d array.

Now we just have to do our all, any, sum on this. For x[i] to be considered x[j] , we need all the values of x[i] to be > to all values of x[j] . Hence the all on axis 2 (the axis of length 3). Now we have a 2d matrix telling for each i,j if x[i]>x[j] .

For x[j] to have a smaller counterpart, that is for x[j] to be greater to at least one x[i] , we need at least one True on x[j] column. Hence the any(axis=1) .

And lastly, what we have at this point is a 1d array of booleans, True if it exists at least one smaller value. We just need to count them. Hence the .sum()

Compound iteration

One-liner (with one loop. Not ideal, but better than 2 loops)

sum((r>x).all(axis=1).any() for r in x)

r>x is an array of booleans comparing each elemnts of row r to each element of x . So, for example, when r is row x[2] , then r>x is

array([[ True,  True,  True],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False]])

So (r>x).all(axis=1) is a shape (4,) array of booleans telling if all booleans in each line (because .all iterates through columns only, axis=1 ) are True or not. In previous example, that would be [True, True, False, False] . (x[1]>x).all(axis=1) would be [False, False, False, False] (first line of x[1]>x contains 2 True , but that is not enough for .all )

So (r>x).all(axis=1).any() tells what you want to know: if there is any line whose all columns are True . That is if there is any True in previous array.

((r>x).all(axis=1).any() for r in x) is an iterator of this computation for all rows r of x. If you replaced the outer ( ) by [ , ] , you would get a list of True and False (False, False, True, True, to be accurate, as you've alraedy said: False for 1st two rows, True for two others). But no need to build a list here, since we just want to count. A compound iterator will produce result only as the caller will require, and here, the caller is sum .

sum((r>x).all(axis=1).any() for r in x) counts the number of times we get True in the previous computation.

(In this case, since there are only 4 elements in the list, it is not like I was sparing much memory by using a compound iterator rather than a compound list. But it is a good habit to try to favor compound iterator when we don't really need to build a list of all intermediary results in memory)

Timings

For your example, computation takes 19 μs for pure numpy, 48 μs for former answer and 115 μs for di.bezrukov's.

But difference (and absence of difference) shows when the number of rows grows. For 10000×3 data, then, computation takes 3.9 seconds for both my answers, and di.bezrukov's method takes 353 seconds.

Reason behind this 2 facts:

  • the fact the difference grows bigger with di.bezrukov's, is because the number of inner for loops that I avoid grows bigger, and they matter a lot
  • the fact that difference between my 2 versions disappear, is because my 2nd version (chronologically, first in this message, aka my pure numpy version) only spare the outer loop. Where the number of rows is not that big, that is not negligible. But when it is big... well that outer loop itself (not counting its content, that is optimized by the innter loop) is just O(n), in a O(n²) result. So, if n is big enough, we just don't care how efficient is this outer loop.
  • Even worst: memory wise, that pure numpy version does what I was so proud of not doing in my first version: compute a full list of result. And that is nothing. It also compute a full 3d matrix of booleans. That are just intermediary result. So, for n big enough (say 100000, unless you have some 50Gb of RAM) that intermediary result doesn't fit into memory. And even if you have 50Gb of RAM, it won't be faster)

Still, all 3 methods are O(n²). O(n²×m) even, if we call m the number of columns

All have 3 nested loops. Di.bezrukov's have two explicit python for loop, and one implicit loop in the .all (still a for loop, even if it is done in numpy's internal code). My compound version has 1 python compound for loop, and 2 implicit loops .all and .any .
My pure numpy version have no explicit loop, but 3 implicity numpy's nested loop (in the building of the 3d array)

So same time structure. Only numpy's loop are faster.

I am prouder of my pure numpy version, because I didn't found it at first. But pragmatically, my first version (compound) is better. It is slower only when it doesn't matter (for very small arrays). It doesn't consume any memory. And it numpize only the outer loop, that is negligible before inner loop.

tl;dr:

sum((r>x).all(axis=1).any() for r in x)

Unless you really have only 4 rows and μs matter, or you are engaged in a contest of who can think in purest numpy 3d-chess:D, in which case

(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM