数组中元素的数量少于python中截止数组的每个元素的数量

Question

I've got a numpy array of strictly increasing "cutoff" values of length m , and a pandas series of values (thought the index isn't important and this could be cast to a numpy array) of values of length n . 我已经得到了严格增加长度“截止”值的numpy的阵列m ，和大熊猫一系列的值（认为指数并不重要，这可以转换为numpy的阵列）的长度值的n 。 I need to come up with an efficient way of spitting out a length m vector of counts of the number of elements in the pandas series less than the jth element of the "cutoff" array. 我需要想出一种有效的方法来分割长度为p的熊猫序列中的元素数的m向量，该向量小于“截止”数组的第j个元素。

I could do this via a list iterator: 我可以通过列表迭代器做到这一点：

output = array([(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar])

but I was wondering if there were any way to do this that leveraged more of numpy's magic speed, as I have to do this quite a few times inside multiple loops and it keeps crasshing my computer. 但是我想知道是否有任何方法可以利用numpy的神奇速度，因为我必须在多个循环中多次执行此操作，并且它一直在破坏我的计算机。

Thanks! 谢谢！

Answer 1

Is this what you are looking for? 这是你想要的？

In [36]: a = np.random.random(20)

In [37]: a
Out[37]: 
array([ 0.68574307,  0.15743428,  0.68006876,  0.63572484,  0.26279663,
        0.14346269,  0.56267286,  0.47250091,  0.91168387,  0.98915746,
        0.22174062,  0.11930722,  0.30848231,  0.1550406 ,  0.60717858,
        0.23805205,  0.57718675,  0.78075297,  0.17083826,  0.87301963])

In [38]: b = np.array((0.3,0.7))

In [39]: np.sum(a[:,None]<b[None,:], axis=0)
Out[39]: array([ 8, 16])

In [40]: np.sum(a[:,None]<b, axis=0) # b's new axis above is unnecessary...
Out[40]: array([ 8, 16])

In [41]: (a[:,None]<b).sum(axis=0)   # even simpler
Out[41]: array([ 8, 16])

Timings are always well received (for a longish, 2E6 elements array) 时序总是很受好评（对于冗长的2E6元素数组）

In [47]: a = np.random.random(2000000)

In [48]: %timeit (a[:,None]<b).sum(axis=0)
10 loops, best of 3: 78.2 ms per loop

In [49]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
1 loop, best of 3: 448 ms per loop

For a smaller array 对于较小的阵列

In [50]: a = np.random.random(2000)

In [51]: %timeit (a[:,None]<b).sum(axis=0)
10000 loops, best of 3: 89 µs per loop

In [52]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 141 µs per loop

Edit 编辑

Divakar says that things may be different for lenghty b s, let's see Divakar说，事情可能是lenghty不同b S，让我们来看看

In [71]: a = np.random.random(2000)

In [72]: b =np.random.random(200)

In [73]: %timeit (a[:,None]<b).sum(axis=0)
1000 loops, best of 3: 1.44 ms per loop

In [74]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
10000 loops, best of 3: 172 µs per loop

quite different indeed! 确实有很大的不同！ Thank you for prompting my curiosity. 谢谢您引起我的好奇。

Probably the OP should test for his use case, very long sample with respect to cutoff sequences or not? 也许OP应该测试他的用例，关于截止序列是否需要很长的样本？ and where there is a balance? 哪里有平衡？

Edit #2 编辑＃2

I made a blooper in my timings, I forgot the axis=0 argument to .sum() ... 我在时间上做了个大事，我忘记了.sum()的axis=0参数...

I've edited the timings with the corrected statement and, of course, the corrected timing. 我已经使用更正的语句（当然也包括更正的时间）编辑了时间。 My apologies. 我很抱歉。

Answer 2

You can use np.searchsorted for some NumPy magic - 您可以将np.searchsorted用于NumPy magic -

# Convert to numpy array for some "magic"
pan_series_arr = np.array(pan_series)

# Let the magic begin!
sortidx = pan_series_arr.argsort()
out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)

Explanation 说明

You are performing [(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar] ie for each element in cutoff_ar , we are counting the number of pan_series elements that are lesser than it. 您正在执行[(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar]在即，对于每个元件cutoff_ar ，我们指望的数量pan_series是比它更小的元件。 Now with np.searchsorted , we are looking for cutoff_ar to be put in a sorted pan_series_arr and get the indices of such positions compared to whom the current element in cutoff_ar is at 'right' position . 现在，使用np.searchsorted ，我们正在寻找cutoff_ar放在已排序的pan_series_arr并获取这些位置的索引（与cutoff_ar当前元素位于'right'位置的人相比）。 These indices essentially represent the number of pan_series elements below the current cutoff_ar element, thus giving us our desired output. 这些索引实质上代表了当前pan_series元素之下的cutoff_ar元素的数量，从而为我们提供了所需的输出。

Sample run 样品运行

 In [302]: cutoff_ar
Out[302]: array([ 1,  3,  9, 44, 63, 90])

In [303]: pan_series_arr
Out[303]: array([ 2,  8, 69, 55, 97])

In [304]: [(pan_series_arr < cutoff_val).sum() for cutoff_val in cutoff_ar]
Out[304]: [0, 1, 2, 2, 3, 4]

In [305]: sortidx = pan_series_arr.argsort()
     ...: out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)
     ...: 

In [306]: out
Out[306]: array([0, 1, 2, 2, 3, 4])

数组中元素的数量少于python中截止数组的每个元素的数量

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-03-30 20:12:25

解决方案2
2 2016-03-30 20:14:04

数组中元素的数量少于python中截止数组的每个元素的数量

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-03-30 20:12:25

解决方案2 2 2016-03-30 20:14:04

解决方案1
2 已采纳 2016-03-30 20:12:25

解决方案2
2 2016-03-30 20:14:04