简体   繁体   English

numpy arrays 的逐元素比较(Python)

[英]Element-wise comparison of numpy arrays (Python)

I would like to ask a question for a numpy array below.我想问一个关于下面 numpy 数组的问题。

I have a dataset, which has 50 rows and 15 columns and I created a numpy array as such:我有一个数据集,它有50 rows and 15 columns ,我创建了一个 numpy 数组,如下所示:

x=x.to_numpy()

My aim is compare each row with other rows(elementwise and except itself) and find whether if there is any row which all values smaller than that row.我的目标是将每一行与其他行进行比较(按元素和自身除外),并找出是否有所有值都小于该行的行。

Sample table:样品表:

a b c         
1 6 2
2 6 8
4 7 12
7 9 13

for example for row 1 and row2 there is no such a row.例如,第 1 行和第 2 行没有这样的一行。 But rows 3,4 there is a row which all values of row 1 and row 2 are smaller than all those.但是第 3,4 行有一行,其中第 1 行和第 2 行的所有值都小于所有这些值。 So the algorithm should return the count 2 (which indicates the row 3 and 4).所以算法应该返回计数 2(表示第 3 行和第 4 行)。

Which Python code should be implemented to get this particular return.应该执行哪个 Python 代码来获得这个特定的回报。

I have tried a bunch of code, but could not reach a proper solution.我尝试了一堆代码,但无法找到合适的解决方案。 So if anyone has an idea on that I would be appreciated.因此,如果有人对此有任何想法,我将不胜感激。

Just use two loops and compare只需使用两个循环并进行比较

import numpy as np

def f(x):
    count = 0

    for i in range(x.shape[0]):
        for j in range(x.shape[0]):
            if i == j:
                continue
            if np.all(x[i] > x[j]):
                count += 1
                break

    return count

x = np.array([[1, 6, 2], [2, 6, 8], [4, 7, 12], [7, 9, 13]])
print(f(x))

Edit: Pure-numpy solution编辑:纯 numpy 解决方案

(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()

Explanation解释

The hard part is to think in 3d, so I start in 2d, with simple comparison of numbers.困难的部分是在 3d 中思考,所以我从 2d 开始,简单地比较数字。 Imagine you have x=np.array([1,2,3,4]) and you want to compare all elements of x to all other elments of x, making a matrix 4x4 matrix of booleans.假设您有x=np.array([1,2,3,4])并且您想要将 x 的所有元素与 x 的所有其他元素进行比较,从而生成一个 4x4 布尔矩阵。

What you would do, is to reshape x as a column of values on one side, and as a line on the other.您要做的是在一侧将 x 重塑为一列值,在另一侧将其重塑为一条线。 So two 2d arrays: one 4x1, the other 1x4.所以有两个二维数组:一个 4x1,另一个 1x4。

Then, when performing an operation among those two arrays, broadcasting will create a 4x4 array.然后,在这两个数组之间执行操作时,广播将创建一个 4x4 数组。

Just to visualize it, instead of comparison, let's do this只是为了形象化,而不是比较,让我们这样做

x=np.array([1,2,3,4])
x.reshape(-1,1) #is
#[[1],
# [2],
# [3],
# [4]]
x.reshape(1,-1) #is
# [ [1,2,3,4] ]
x.reshape(-1,1)*10+x.reshape(1,-1) #is therefore
# [[11, 12, 13, 14],
#  [21, 22, 23, 24],
#  [31, 32, 33, 34],
#  [41, 42, 43, 44]]

# Likewise 
x.reshape(-1,1)<x.reshape(1,-1) # is
#array([[False,  True,  True,  True],
#       [False, False,  True,  True],
#       [False, False, False,  True],
#       [False, False, False, False]])

So, all we have to do is the exact same thing.所以,我们所要做的就是完全一样的事情。 But with values being length-3 1d arrays instead of scalars:但是值是长度为 3 的一维数组而不是标量:
x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)

Broadcasting will make this, as in previous example, a 2d array of all x[i]>x[j] , except that x[i] , x[j] and therefore x[i]>x[j] are not values, but 1d length 3 arrays.与前面的示例一样,广播将使它成为所有x[i]>x[j]的二维数组,除了x[i]x[j]和因此x[i]>x[j]不是值, 但 1d 长度 3 阵列。 So our result is a 2d array of length 3 1d array, aka a 3d array.所以我们的结果是一个长度为 3 的 1d 数组的 2d 数组,也就是 3d 数组。

Now we just have to do our all, any, sum on this.现在我们只需要做我们所有的,任何,总和。 For x[i] to be considered x[j] , we need all the values of x[i] to be > to all values of x[j] .要将x[i]视为x[j] ,我们需要x[i]的所有值>x[j]的所有值。 Hence the all on axis 2 (the axis of length 3).因此, all在轴 2(长度 3 的轴)上。 Now we have a 2d matrix telling for each i,j if x[i]>x[j] .现在我们有一个二维矩阵告诉每个 i,j 如果x[i]>x[j]

For x[j] to have a smaller counterpart, that is for x[j] to be greater to at least one x[i] , we need at least one True on x[j] column.为了使x[j]具有较小的对应项,即x[j]大于至少一个x[i] ,我们需要在x[j]列上至少有一个 True。 Hence the any(axis=1) .因此any(axis=1)

And lastly, what we have at this point is a 1d array of booleans, True if it exists at least one smaller value.最后,此时我们拥有的是一维布尔数组,如果它至少存在一个较小的值,则为 True。 We just need to count them.我们只需要计算它们。 Hence the .sum()因此.sum()

Compound iteration复合迭代

One-liner (with one loop. Not ideal, but better than 2 loops)单线(带一个环。不理想,但比 2 个环好)

sum((r>x).all(axis=1).any() for r in x)

r>x is an array of booleans comparing each elemnts of row r to each element of x . r>x是一个布尔数组,将行r的每个元素与x的每个元素进行比较。 So, for example, when r is row x[2] , then r>x is因此,例如,当r是行x[2]时,则r>x

array([[ True,  True,  True],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False]])

So (r>x).all(axis=1) is a shape (4,) array of booleans telling if all booleans in each line (because .all iterates through columns only, axis=1 ) are True or not.所以(r>x).all(axis=1)是一个形状(4,)的布尔值数组,告诉每行中的所有布尔值(因为.all仅遍历列, axis=1 )是否为真。 In previous example, that would be [True, True, False, False] .在前面的示例中,那将是[True, True, False, False] (x[1]>x).all(axis=1) would be [False, False, False, False] (first line of x[1]>x contains 2 True , but that is not enough for .all ) (x[1]>x).all(axis=1)将是[False, False, False, False]x[1]>x的第一行包含 2 个True ,但这对于.all来说还不够)

So (r>x).all(axis=1).any() tells what you want to know: if there is any line whose all columns are True .所以(r>x).all(axis=1).any()告诉你想知道的:是否有任何一行的所有列都是True That is if there is any True in previous array.那就是如果前面的数组中有任何 True 。

((r>x).all(axis=1).any() for r in x) is an iterator of this computation for all rows r of x. ((r>x).all(axis=1).any() for r in x)是针对 x 的所有行r的此计算的迭代器。 If you replaced the outer ( ) by [ , ] , you would get a list of True and False (False, False, True, True, to be accurate, as you've alraedy said: False for 1st two rows, True for two others).如果您将外部( )替换为[ , ] ,您将得到一个TrueFalse列表(False,False,True,True,准确地说,正如您已经说过的那样:第一行为 False,第二行为 True其他)。 But no need to build a list here, since we just want to count.但是不需要在这里建立一个列表,因为我们只是想数数。 A compound iterator will produce result only as the caller will require, and here, the caller is sum .复合迭代器只会在调用者需要时产生结果,在这里,调用者是sum

sum((r>x).all(axis=1).any() for r in x) counts the number of times we get True in the previous computation. sum((r>x).all(axis=1).any() for r in x)计算我们在之前的计算中得到True的次数。

(In this case, since there are only 4 elements in the list, it is not like I was sparing much memory by using a compound iterator rather than a compound list. But it is a good habit to try to favor compound iterator when we don't really need to build a list of all intermediary results in memory) (在这种情况下,因为列表中只有 4 个元素,所以我并没有通过使用复合迭代器而不是复合列表来节省大量内存。但是当我们不使用复合迭代器时,尝试使用复合迭代器是一个好习惯'真的需要在内存中构建所有中间结果的列表)

Timings时序

For your example, computation takes 19 μs for pure numpy, 48 μs for former answer and 115 μs for di.bezrukov's.对于您的示例,纯 numpy 的计算需要 19 微秒,前一个答案需要 48 微秒,di.bezrukov 的需要 115 微秒。

But difference (and absence of difference) shows when the number of rows grows.但是差异(以及没有差异)会显示行数何时增加。 For 10000×3 data, then, computation takes 3.9 seconds for both my answers, and di.bezrukov's method takes 353 seconds.那么对于 10000×3 的数据,我的两个答案的计算都需要 3.9 秒,而 di.bezrukov 的方法需要 353 秒。

Reason behind this 2 facts:这两个事实背后的原因:

  • the fact the difference grows bigger with di.bezrukov's, is because the number of inner for loops that I avoid grows bigger, and they matter a lot di.bezrukov 的差异变大的事实是因为我避免的内部 for 循环的数量变大了,而且它们很重要
  • the fact that difference between my 2 versions disappear, is because my 2nd version (chronologically, first in this message, aka my pure numpy version) only spare the outer loop.我的两个版本之间的差异消失的事实是因为我的第二个版本(按时间顺序,在这条消息中首先,也就是我的纯 numpy 版本)只保留了外循环。 Where the number of rows is not that big, that is not negligible.如果行数不是那么大,那是不可忽略的。 But when it is big... well that outer loop itself (not counting its content, that is optimized by the innter loop) is just O(n), in a O(n²) result.但是当它很大时......好吧,外循环本身(不计算其内容,由内循环优化)只是 O(n),在 O(n²) 结果中。 So, if n is big enough, we just don't care how efficient is this outer loop.所以,如果 n 足够大,我们就不关心这个外循环的效率如何。
  • Even worst: memory wise, that pure numpy version does what I was so proud of not doing in my first version: compute a full list of result.更糟糕的是:在内存方面,纯 numpy 版本做了我为在我的第一个版本中没有做而感到自豪的事情:计算结果的完整列表。 And that is nothing.那没什么。 It also compute a full 3d matrix of booleans.它还计算一个完整的 3d 布尔矩阵。 That are just intermediary result.那只是中间结果。 So, for n big enough (say 100000, unless you have some 50Gb of RAM) that intermediary result doesn't fit into memory.因此,对于足够大的 n(例如 100000,除非您有 50Gb 的 RAM),中间结果不适合内存。 And even if you have 50Gb of RAM, it won't be faster)即使你有 50Gb 的 RAM,它也不会更快)

Still, all 3 methods are O(n²).尽管如此,所有 3 种方法都是 O(n²)。 O(n²×m) even, if we call m the number of columns O(n²×m) 偶数,如果我们称m为列数

All have 3 nested loops.都有 3 个嵌套循环。 Di.bezrukov's have two explicit python for loop, and one implicit loop in the .all (still a for loop, even if it is done in numpy's internal code). Di.bezrukov 在.all中有两个显式的 python for循环和一个隐式循环(仍然是一个 for 循环,即使它是在 numpy 的内部代码中完成的)。 My compound version has 1 python compound for loop, and 2 implicit loops .all and .any .我的复合版本有 1 个 python compound for循环和 2 个隐式循环.all.any
My pure numpy version have no explicit loop, but 3 implicity numpy's nested loop (in the building of the 3d array)我的纯 numpy 版本没有显式循环,但有 3 个隐式 numpy 的嵌套循环(在 3d 数组的构建中)

So same time structure.所以同时结构。 Only numpy's loop are faster.只有 numpy 的循环更快。

I am prouder of my pure numpy version, because I didn't found it at first.我为我的纯 numpy 版本感到自豪,因为我一开始没有找到它。 But pragmatically, my first version (compound) is better.但实际上,我的第一个版本(复合)更好。 It is slower only when it doesn't matter (for very small arrays).仅当无关紧要时(对于非常小的阵列),它才会变慢。 It doesn't consume any memory.它不消耗任何内存。 And it numpize only the outer loop, that is negligible before inner loop.并且它只对外循环进行 numize,在内循环之前可以忽略不计。

tl;dr:长话短说:

sum((r>x).all(axis=1).any() for r in x)

Unless you really have only 4 rows and μs matter, or you are engaged in a contest of who can think in purest numpy 3d-chess:D, in which case除非你真的只有 4 行并且 μs 很重要,或者你正在参与谁可以在最纯粹的 numpy 3d-chess:D 中思考的竞赛,在这种情况下

(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM