I would like to ask a question for a numpy array below.
I have a dataset, which has 50 rows and 15 columns
and I created a numpy array as such:
x=x.to_numpy()
My aim is compare each row with other rows(elementwise and except itself) and find whether if there is any row which all values smaller than that row.
Sample table:
a b c
1 6 2
2 6 8
4 7 12
7 9 13
for example for row 1 and row2 there is no such a row. But rows 3,4 there is a row which all values of row 1 and row 2 are smaller than all those. So the algorithm should return the count 2 (which indicates the row 3 and 4).
Which Python code should be implemented to get this particular return.
I have tried a bunch of code, but could not reach a proper solution. So if anyone has an idea on that I would be appreciated.
Just use two loops and compare
import numpy as np
def f(x):
count = 0
for i in range(x.shape[0]):
for j in range(x.shape[0]):
if i == j:
continue
if np.all(x[i] > x[j]):
count += 1
break
return count
x = np.array([[1, 6, 2], [2, 6, 8], [4, 7, 12], [7, 9, 13]])
print(f(x))
(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()
The hard part is to think in 3d, so I start in 2d, with simple comparison of numbers. Imagine you have x=np.array([1,2,3,4])
and you want to compare all elements of x to all other elments of x, making a matrix 4x4 matrix of booleans.
What you would do, is to reshape x as a column of values on one side, and as a line on the other. So two 2d arrays: one 4x1, the other 1x4.
Then, when performing an operation among those two arrays, broadcasting will create a 4x4 array.
Just to visualize it, instead of comparison, let's do this
x=np.array([1,2,3,4])
x.reshape(-1,1) #is
#[[1],
# [2],
# [3],
# [4]]
x.reshape(1,-1) #is
# [ [1,2,3,4] ]
x.reshape(-1,1)*10+x.reshape(1,-1) #is therefore
# [[11, 12, 13, 14],
# [21, 22, 23, 24],
# [31, 32, 33, 34],
# [41, 42, 43, 44]]
# Likewise
x.reshape(-1,1)<x.reshape(1,-1) # is
#array([[False, True, True, True],
# [False, False, True, True],
# [False, False, False, True],
# [False, False, False, False]])
So, all we have to do is the exact same thing. But with values being length-3 1d arrays instead of scalars:
x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)
Broadcasting will make this, as in previous example, a 2d array of all x[i]>x[j]
, except that x[i]
, x[j]
and therefore x[i]>x[j]
are not values, but 1d length 3 arrays. So our result is a 2d array of length 3 1d array, aka a 3d array.
Now we just have to do our all, any, sum on this. For x[i]
to be considered x[j]
, we need all the values of x[i]
to be >
to all values of x[j]
. Hence the all
on axis 2 (the axis of length 3). Now we have a 2d matrix telling for each i,j if x[i]>x[j]
.
For x[j]
to have a smaller counterpart, that is for x[j]
to be greater to at least one x[i]
, we need at least one True on x[j]
column. Hence the any(axis=1)
.
And lastly, what we have at this point is a 1d array of booleans, True if it exists at least one smaller value. We just need to count them. Hence the .sum()
One-liner (with one loop. Not ideal, but better than 2 loops)
sum((r>x).all(axis=1).any() for r in x)
r>x
is an array of booleans comparing each elemnts of row r
to each element of x
. So, for example, when r
is row x[2]
, then r>x
is
array([[ True, True, True],
[ True, True, True],
[False, False, False],
[False, False, False]])
So (r>x).all(axis=1)
is a shape (4,)
array of booleans telling if all booleans in each line (because .all
iterates through columns only, axis=1
) are True or not. In previous example, that would be [True, True, False, False]
. (x[1]>x).all(axis=1)
would be [False, False, False, False]
(first line of x[1]>x
contains 2 True
, but that is not enough for .all
)
So (r>x).all(axis=1).any()
tells what you want to know: if there is any line whose all columns are True
. That is if there is any True in previous array.
((r>x).all(axis=1).any() for r in x)
is an iterator of this computation for all rows r
of x. If you replaced the outer (
)
by [
, ]
, you would get a list of True
and False
(False, False, True, True, to be accurate, as you've alraedy said: False for 1st two rows, True for two others). But no need to build a list here, since we just want to count. A compound iterator will produce result only as the caller will require, and here, the caller is sum
.
sum((r>x).all(axis=1).any() for r in x)
counts the number of times we get True
in the previous computation.
(In this case, since there are only 4 elements in the list, it is not like I was sparing much memory by using a compound iterator rather than a compound list. But it is a good habit to try to favor compound iterator when we don't really need to build a list of all intermediary results in memory)
For your example, computation takes 19 μs for pure numpy, 48 μs for former answer and 115 μs for di.bezrukov's.
But difference (and absence of difference) shows when the number of rows grows. For 10000×3 data, then, computation takes 3.9 seconds for both my answers, and di.bezrukov's method takes 353 seconds.
Reason behind this 2 facts:
Still, all 3 methods are O(n²). O(n²×m) even, if we call m
the number of columns
All have 3 nested loops. Di.bezrukov's have two explicit python for
loop, and one implicit loop in the .all
(still a for loop, even if it is done in numpy's internal code). My compound version has 1 python compound for
loop, and 2 implicit loops .all
and .any
.
My pure numpy version have no explicit loop, but 3 implicity numpy's nested loop (in the building of the 3d array)
So same time structure. Only numpy's loop are faster.
I am prouder of my pure numpy version, because I didn't found it at first. But pragmatically, my first version (compound) is better. It is slower only when it doesn't matter (for very small arrays). It doesn't consume any memory. And it numpize only the outer loop, that is negligible before inner loop.
tl;dr:
sum((r>x).all(axis=1).any() for r in x)
Unless you really have only 4 rows and μs matter, or you are engaged in a contest of who can think in purest numpy 3d-chess:D, in which case
(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.