简体   繁体   English

如何有效地使用 numpy 面罩?

[英]How do I use a numpy mask efficiently?

I have this array a :我有这个数组a

[[  1.       1.       0.      42.533   43.53   159.6652]
 [  1.       1.       0.      57.122   28.21   144.8538]
 [  1.       1.       1.      86.586   32.37   195.6714]
 [  1.       2.       1.      33.768    4.89    58.5222]
 [  1.       2.       0.      90.336   30.19   195.9074]
 [  1.       2.       0.      57.099   27.16   142.4066]
 [  2.       3.       0.      48.371   19.14   103.0763]
 [  2.       3.       1.      30.82     4.74    50.02  ]
 [  2.       3.       0.      27.147   50.98   142.3491]
 [  2.       4.       0.      27.275   43.79   127.4165]
 [  2.       4.       0.      79.439    8.79   121.7297]
 [  2.       4.       1.      21.747   44.44   121.5951]]

What I would like to do is... well let me show you.我想做的是……好吧,让我告诉你。

mask = np.array([np.where((a[:, 1]==i[1]) & (a[:, 2]==1)) for i in a])
a[:, -1] -= a[mask][:, 0, 0, -1]

What the above code does is:上面代码的作用是:

Suppose for each row i , the last element is v_i .假设对于每一行i ,最后一个元素是v_i For each row i , we have a row with the same 2nd element as i and with 3rd element equal to 1. Call this row j .对于每一行i ,我们有一行与i具有相同的第二个元素并且第三个元素等于 1。将此行称为j Now we subtract the last element of j from the last element of i .现在我们从i的最后一个元素中减去j的最后一个元素。 That is, v_i = v_i - v_j .也就是说, v_i = v_i - v_j

The code I have pasted above works fine.我在上面粘贴的代码工作正常。 But it takes way too long (on my actual array which is way bigger than the one I've pasted as an example).但这需要的时间太长(在我的实际数组上,比我粘贴的示例大得多)。 I am quite sure it is the list comprehension that is slowing it down.我很确定是列表理解减慢了它的速度。 So I am looking for a way to do this faster, possibly even without a loop (or a mask either).所以我正在寻找一种方法来更快地做到这一点,甚至可能没有循环(或面具)。

I would also like to ask if there is a way to get the sum of the last elements of the rows, grouped by the value of the 2nd element.我还想问是否有办法获得行的最后一个元素的总和,按第二个元素的值分组。

So, for example the first element of that result would be 159.6652+144.8538+195.6714=500.1904因此,例如该结果的第一个元素将是 159.6652+144.8538+195.6714=500.1904

And I would have 4 such numbers.我会有4个这样的数字。 Again, I have done this using a loop but it takes too much time to run!同样,我使用循环完成了此操作,但运行时间太长!

I am new to numpy and just learned how important speed is when working with large datasets.我是 numpy 的新手,刚刚了解到处理大型数据集时速度的重要性。 I would be grateful if I can learn something new from here.如果我能从这里学到新的东西,我将不胜感激。 Thanks for taking the time to read this.感谢您抽时间阅读。 Please feel free to comment if anything isn't clear.如果有任何不清楚的地方,请随时发表评论。

Here is a solution using np.unique .这是使用np.unique的解决方案。 It makes no assumption on the order of rows.它不对行的顺序做任何假设。 If the 2nd column is already grouped and ordered as in your example this can be simplified.如果第二列已经按照您的示例进行分组和排序,则可以简化。

# find unique id's and
# idx such that unq[idx] would recover a[:,1]
unq,idx = np.unique(a[:,1],return_inverse=True)
unq
# array([1., 2., 3., 4.])
idx
# array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])

# find indices of reference rows     
ridx = a[:,2].nonzero()[0]
ridx
# array([ 2,  3,  7, 11])

# extract reference rows (last col only) in order of unq
ref = np.empty(unq.size,a.dtype)
ref[idx[ridx]] = a[ridx,-1]
ref
# array([195.6714,  58.5222,  50.02  , 121.5951])

# subtract reference
# (replace "-" with "-=" to subtract in-place) 
a[:,-1] - ref[idx]
# array([-3.600620e+01, -5.081760e+01,  0.000000e+00,  0.000000e+00,
#         1.373852e+02,  8.388440e+01,  5.305630e+01,  0.000000e+00,
#         9.232910e+01,  5.821400e+00,  1.346000e-01,  0.000000e+00])

# group sums 
np.bincount(idx,a[:,-1])
# array([500.1904, 396.8362, 295.4454, 370.7413])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM