[英]Element-wise comparison of numpy arrays (Python)
I would like to ask a question for a numpy array below.我想问一个关于下面 numpy 数组的问题。
I have a dataset, which has 50 rows and 15 columns
and I created a numpy array as such:我有一个数据集,它有50 rows and 15 columns
,我创建了一个 numpy 数组,如下所示:
x=x.to_numpy()
My aim is compare each row with other rows(elementwise and except itself) and find whether if there is any row which all values smaller than that row.我的目标是将每一行与其他行进行比较(按元素和自身除外),并找出是否有所有值都小于该行的行。
Sample table:样品表:
a b c
1 6 2
2 6 8
4 7 12
7 9 13
for example for row 1 and row2 there is no such a row.例如,第 1 行和第 2 行没有这样的一行。 But rows 3,4 there is a row which all values of row 1 and row 2 are smaller than all those.但是第 3,4 行有一行,其中第 1 行和第 2 行的所有值都小于所有这些值。 So the algorithm should return the count 2 (which indicates the row 3 and 4).所以算法应该返回计数 2(表示第 3 行和第 4 行)。
Which Python code should be implemented to get this particular return.应该执行哪个 Python 代码来获得这个特定的回报。
I have tried a bunch of code, but could not reach a proper solution.我尝试了一堆代码,但无法找到合适的解决方案。 So if anyone has an idea on that I would be appreciated.因此,如果有人对此有任何想法,我将不胜感激。
Just use two loops and compare只需使用两个循环并进行比较
import numpy as np
def f(x):
count = 0
for i in range(x.shape[0]):
for j in range(x.shape[0]):
if i == j:
continue
if np.all(x[i] > x[j]):
count += 1
break
return count
x = np.array([[1, 6, 2], [2, 6, 8], [4, 7, 12], [7, 9, 13]])
print(f(x))
(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()
The hard part is to think in 3d, so I start in 2d, with simple comparison of numbers.困难的部分是在 3d 中思考,所以我从 2d 开始,简单地比较数字。 Imagine you have x=np.array([1,2,3,4])
and you want to compare all elements of x to all other elments of x, making a matrix 4x4 matrix of booleans.假设您有x=np.array([1,2,3,4])
并且您想要将 x 的所有元素与 x 的所有其他元素进行比较,从而生成一个 4x4 布尔矩阵。
What you would do, is to reshape x as a column of values on one side, and as a line on the other.您要做的是在一侧将 x 重塑为一列值,在另一侧将其重塑为一条线。 So two 2d arrays: one 4x1, the other 1x4.所以有两个二维数组:一个 4x1,另一个 1x4。
Then, when performing an operation among those two arrays, broadcasting will create a 4x4 array.然后,在这两个数组之间执行操作时,广播将创建一个 4x4 数组。
Just to visualize it, instead of comparison, let's do this只是为了形象化,而不是比较,让我们这样做
x=np.array([1,2,3,4])
x.reshape(-1,1) #is
#[[1],
# [2],
# [3],
# [4]]
x.reshape(1,-1) #is
# [ [1,2,3,4] ]
x.reshape(-1,1)*10+x.reshape(1,-1) #is therefore
# [[11, 12, 13, 14],
# [21, 22, 23, 24],
# [31, 32, 33, 34],
# [41, 42, 43, 44]]
# Likewise
x.reshape(-1,1)<x.reshape(1,-1) # is
#array([[False, True, True, True],
# [False, False, True, True],
# [False, False, False, True],
# [False, False, False, False]])
So, all we have to do is the exact same thing.所以,我们所要做的就是完全一样的事情。 But with values being length-3 1d arrays instead of scalars:但是值是长度为 3 的一维数组而不是标量:
x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)
Broadcasting will make this, as in previous example, a 2d array of all x[i]>x[j]
, except that x[i]
, x[j]
and therefore x[i]>x[j]
are not values, but 1d length 3 arrays.与前面的示例一样,广播将使它成为所有x[i]>x[j]
的二维数组,除了x[i]
、 x[j]
和因此x[i]>x[j]
不是值, 但 1d 长度 3 阵列。 So our result is a 2d array of length 3 1d array, aka a 3d array.所以我们的结果是一个长度为 3 的 1d 数组的 2d 数组,也就是 3d 数组。
Now we just have to do our all, any, sum on this.现在我们只需要做我们所有的,任何,总和。 For x[i]
to be considered x[j]
, we need all the values of x[i]
to be >
to all values of x[j]
.要将x[i]
视为x[j]
,我们需要x[i]
的所有值>
到x[j]
的所有值。 Hence the all
on axis 2 (the axis of length 3).因此, all
在轴 2(长度 3 的轴)上。 Now we have a 2d matrix telling for each i,j if x[i]>x[j]
.现在我们有一个二维矩阵告诉每个 i,j 如果x[i]>x[j]
。
For x[j]
to have a smaller counterpart, that is for x[j]
to be greater to at least one x[i]
, we need at least one True on x[j]
column.为了使x[j]
具有较小的对应项,即x[j]
大于至少一个x[i]
,我们需要在x[j]
列上至少有一个 True。 Hence the any(axis=1)
.因此any(axis=1)
。
And lastly, what we have at this point is a 1d array of booleans, True if it exists at least one smaller value.最后,此时我们拥有的是一维布尔数组,如果它至少存在一个较小的值,则为 True。 We just need to count them.我们只需要计算它们。 Hence the .sum()
因此.sum()
One-liner (with one loop. Not ideal, but better than 2 loops)单线(带一个环。不理想,但比 2 个环好)
sum((r>x).all(axis=1).any() for r in x)
r>x
is an array of booleans comparing each elemnts of row r
to each element of x
. r>x
是一个布尔数组,将行r
的每个元素与x
的每个元素进行比较。 So, for example, when r
is row x[2]
, then r>x
is因此,例如,当r
是行x[2]
时,则r>x
是
array([[ True, True, True],
[ True, True, True],
[False, False, False],
[False, False, False]])
So (r>x).all(axis=1)
is a shape (4,)
array of booleans telling if all booleans in each line (because .all
iterates through columns only, axis=1
) are True or not.所以(r>x).all(axis=1)
是一个形状(4,)
的布尔值数组,告诉每行中的所有布尔值(因为.all
仅遍历列, axis=1
)是否为真。 In previous example, that would be [True, True, False, False]
.在前面的示例中,那将是[True, True, False, False]
。 (x[1]>x).all(axis=1)
would be [False, False, False, False]
(first line of x[1]>x
contains 2 True
, but that is not enough for .all
) (x[1]>x).all(axis=1)
将是[False, False, False, False]
( x[1]>x
的第一行包含 2 个True
,但这对于.all
来说还不够)
So (r>x).all(axis=1).any()
tells what you want to know: if there is any line whose all columns are True
.所以(r>x).all(axis=1).any()
告诉你想知道的:是否有任何一行的所有列都是True
。 That is if there is any True in previous array.那就是如果前面的数组中有任何 True 。
((r>x).all(axis=1).any() for r in x)
is an iterator of this computation for all rows r
of x. ((r>x).all(axis=1).any() for r in x)
是针对 x 的所有行r
的此计算的迭代器。 If you replaced the outer (
)
by [
, ]
, you would get a list of True
and False
(False, False, True, True, to be accurate, as you've alraedy said: False for 1st two rows, True for two others).如果您将外部(
)
替换为[
, ]
,您将得到一个True
和False
列表(False,False,True,True,准确地说,正如您已经说过的那样:第一行为 False,第二行为 True其他)。 But no need to build a list here, since we just want to count.但是不需要在这里建立一个列表,因为我们只是想数数。 A compound iterator will produce result only as the caller will require, and here, the caller is sum
.复合迭代器只会在调用者需要时产生结果,在这里,调用者是sum
。
sum((r>x).all(axis=1).any() for r in x)
counts the number of times we get True
in the previous computation. sum((r>x).all(axis=1).any() for r in x)
计算我们在之前的计算中得到True
的次数。
(In this case, since there are only 4 elements in the list, it is not like I was sparing much memory by using a compound iterator rather than a compound list. But it is a good habit to try to favor compound iterator when we don't really need to build a list of all intermediary results in memory) (在这种情况下,因为列表中只有 4 个元素,所以我并没有通过使用复合迭代器而不是复合列表来节省大量内存。但是当我们不使用复合迭代器时,尝试使用复合迭代器是一个好习惯'真的需要在内存中构建所有中间结果的列表)
For your example, computation takes 19 μs for pure numpy, 48 μs for former answer and 115 μs for di.bezrukov's.对于您的示例,纯 numpy 的计算需要 19 微秒,前一个答案需要 48 微秒,di.bezrukov 的需要 115 微秒。
But difference (and absence of difference) shows when the number of rows grows.但是差异(以及没有差异)会显示行数何时增加。 For 10000×3 data, then, computation takes 3.9 seconds for both my answers, and di.bezrukov's method takes 353 seconds.那么对于 10000×3 的数据,我的两个答案的计算都需要 3.9 秒,而 di.bezrukov 的方法需要 353 秒。
Reason behind this 2 facts:这两个事实背后的原因:
Still, all 3 methods are O(n²).尽管如此,所有 3 种方法都是 O(n²)。 O(n²×m) even, if we call m
the number of columns O(n²×m) 偶数,如果我们称m
为列数
All have 3 nested loops.都有 3 个嵌套循环。 Di.bezrukov's have two explicit python for
loop, and one implicit loop in the .all
(still a for loop, even if it is done in numpy's internal code). Di.bezrukov 在.all
中有两个显式的 python for
循环和一个隐式循环(仍然是一个 for 循环,即使它是在 numpy 的内部代码中完成的)。 My compound version has 1 python compound for
loop, and 2 implicit loops .all
and .any
.我的复合版本有 1 个 python compound for
循环和 2 个隐式循环.all
和.any
。
My pure numpy version have no explicit loop, but 3 implicity numpy's nested loop (in the building of the 3d array)我的纯 numpy 版本没有显式循环,但有 3 个隐式 numpy 的嵌套循环(在 3d 数组的构建中)
So same time structure.所以同时结构。 Only numpy's loop are faster.只有 numpy 的循环更快。
I am prouder of my pure numpy version, because I didn't found it at first.我为我的纯 numpy 版本感到自豪,因为我一开始没有找到它。 But pragmatically, my first version (compound) is better.但实际上,我的第一个版本(复合)更好。 It is slower only when it doesn't matter (for very small arrays).仅当无关紧要时(对于非常小的阵列),它才会变慢。 It doesn't consume any memory.它不消耗任何内存。 And it numpize only the outer loop, that is negligible before inner loop.并且它只对外循环进行 numize,在内循环之前可以忽略不计。
tl;dr:长话短说:
sum((r>x).all(axis=1).any() for r in x)
Unless you really have only 4 rows and μs matter, or you are engaged in a contest of who can think in purest numpy 3d-chess:D, in which case除非你真的只有 4 行并且 μs 很重要,或者你正在参与谁可以在最纯粹的 numpy 3d-chess:D 中思考的竞赛,在这种情况下
(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.