[英]Number of elements of array less than each element of cutoff array in python
I've got a numpy array of strictly increasing "cutoff" values of length m
, and a pandas series of values (thought the index isn't important and this could be cast to a numpy array) of values of length n
. 我已经得到了严格增加长度“截止”值的numpy的阵列
m
,和大熊猫一系列的值(认为指数并不重要,这可以转换为numpy的阵列)的长度值的n
。 I need to come up with an efficient way of spitting out a length m
vector of counts of the number of elements in the pandas series less than the jth element of the "cutoff" array. 我需要想出一种有效的方法来分割长度为p的熊猫序列中的元素数的
m
向量,该向量小于“截止”数组的第j个元素。
I could do this via a list iterator: 我可以通过列表迭代器做到这一点:
output = array([(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar])
but I was wondering if there were any way to do this that leveraged more of numpy's magic speed, as I have to do this quite a few times inside multiple loops and it keeps crasshing my computer. 但是我想知道是否有任何方法可以利用numpy的神奇速度,因为我必须在多个循环中多次执行此操作,并且它一直在破坏我的计算机。
Thanks! 谢谢!
Is this what you are looking for? 这是你想要的?
In [36]: a = np.random.random(20)
In [37]: a
Out[37]:
array([ 0.68574307, 0.15743428, 0.68006876, 0.63572484, 0.26279663,
0.14346269, 0.56267286, 0.47250091, 0.91168387, 0.98915746,
0.22174062, 0.11930722, 0.30848231, 0.1550406 , 0.60717858,
0.23805205, 0.57718675, 0.78075297, 0.17083826, 0.87301963])
In [38]: b = np.array((0.3,0.7))
In [39]: np.sum(a[:,None]<b[None,:], axis=0)
Out[39]: array([ 8, 16])
In [40]: np.sum(a[:,None]<b, axis=0) # b's new axis above is unnecessary...
Out[40]: array([ 8, 16])
In [41]: (a[:,None]<b).sum(axis=0) # even simpler
Out[41]: array([ 8, 16])
Timings are always well received (for a longish, 2E6 elements array) 时序总是很受好评(对于冗长的2E6元素数组)
In [47]: a = np.random.random(2000000)
In [48]: %timeit (a[:,None]<b).sum(axis=0)
10 loops, best of 3: 78.2 ms per loop
In [49]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
1 loop, best of 3: 448 ms per loop
For a smaller array 对于较小的阵列
In [50]: a = np.random.random(2000)
In [51]: %timeit (a[:,None]<b).sum(axis=0)
10000 loops, best of 3: 89 µs per loop
In [52]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 141 µs per loop
Edit 编辑
Divakar says that things may be different for lenghty b
s, let's see Divakar说,事情可能是lenghty不同
b
S,让我们来看看
In [71]: a = np.random.random(2000)
In [72]: b =np.random.random(200)
In [73]: %timeit (a[:,None]<b).sum(axis=0)
1000 loops, best of 3: 1.44 ms per loop
In [74]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
10000 loops, best of 3: 172 µs per loop
quite different indeed! 确实有很大的不同! Thank you for prompting my curiosity.
谢谢您引起我的好奇。
Probably the OP should test for his use case, very long sample with respect to cutoff sequences or not? 也许OP应该测试他的用例,关于截止序列是否需要很长的样本? and where there is a balance?
哪里有平衡?
Edit #2 编辑#2
I made a blooper in my timings, I forgot the axis=0
argument to .sum()
... 我在时间上做了个大事,我忘记了
.sum()
的axis=0
参数...
I've edited the timings with the corrected statement and, of course, the corrected timing. 我已经使用更正的语句(当然也包括更正的时间)编辑了时间。 My apologies.
我很抱歉。
You can use np.searchsorted
for some NumPy magic
- 您可以将
np.searchsorted
用于NumPy magic
-
# Convert to numpy array for some "magic"
pan_series_arr = np.array(pan_series)
# Let the magic begin!
sortidx = pan_series_arr.argsort()
out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)
Explanation 说明
You are performing [(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar]
ie for each element in cutoff_ar
, we are counting the number of pan_series
elements that are lesser than it. 您正在执行
[(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar]
在即,对于每个元件cutoff_ar
,我们指望的数量pan_series
是比它更小的元件。 Now with np.searchsorted
, we are looking for cutoff_ar
to be put in a sorted pan_series_arr
and get the indices of such positions compared to whom the current element in cutoff_ar
is at 'right'
position . 现在,使用
np.searchsorted
,我们正在寻找cutoff_ar
放在已排序的pan_series_arr
并获取这些位置的索引(与cutoff_ar
当前元素位于'right'
位置的人相比)。 These indices essentially represent the number of pan_series
elements below the current cutoff_ar
element, thus giving us our desired output. 这些索引实质上代表了当前
pan_series
元素之下的cutoff_ar
元素的数量,从而为我们提供了所需的输出。
Sample run 样品运行
In [302]: cutoff_ar
Out[302]: array([ 1, 3, 9, 44, 63, 90])
In [303]: pan_series_arr
Out[303]: array([ 2, 8, 69, 55, 97])
In [304]: [(pan_series_arr < cutoff_val).sum() for cutoff_val in cutoff_ar]
Out[304]: [0, 1, 2, 2, 3, 4]
In [305]: sortidx = pan_series_arr.argsort()
...: out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)
...:
In [306]: out
Out[306]: array([0, 1, 2, 2, 3, 4])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.