简体   繁体   English

数组中元素的数量少于python中截止数组的每个元素的数量

[英]Number of elements of array less than each element of cutoff array in python

I've got a numpy array of strictly increasing "cutoff" values of length m , and a pandas series of values (thought the index isn't important and this could be cast to a numpy array) of values of length n . 我已经得到了严格增加长度“截止”值的numpy的阵列m ,和大熊猫一系列的值(认为指数并不重要,这可以转换为numpy的阵列)的长度值的n I need to come up with an efficient way of spitting out a length m vector of counts of the number of elements in the pandas series less than the jth element of the "cutoff" array. 我需要想出一种有效的方法来分割长度为p的熊猫序列中的元素数的m向量,该向量小于“截止”数组的第j个元素。

I could do this via a list iterator: 我可以通过列表迭代器做到这一点:

output = array([(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar])

but I was wondering if there were any way to do this that leveraged more of numpy's magic speed, as I have to do this quite a few times inside multiple loops and it keeps crasshing my computer. 但是我想知道是否有任何方法可以利用numpy的神奇速度,因为我必须在多个循环中多次执行此操作,并且它一直在破坏我的计算机。

Thanks! 谢谢!

Is this what you are looking for? 这是你想要的?

In [36]: a = np.random.random(20)

In [37]: a
Out[37]: 
array([ 0.68574307,  0.15743428,  0.68006876,  0.63572484,  0.26279663,
        0.14346269,  0.56267286,  0.47250091,  0.91168387,  0.98915746,
        0.22174062,  0.11930722,  0.30848231,  0.1550406 ,  0.60717858,
        0.23805205,  0.57718675,  0.78075297,  0.17083826,  0.87301963])

In [38]: b = np.array((0.3,0.7))

In [39]: np.sum(a[:,None]<b[None,:], axis=0)
Out[39]: array([ 8, 16])

In [40]: np.sum(a[:,None]<b, axis=0) # b's new axis above is unnecessary...
Out[40]: array([ 8, 16])

In [41]: (a[:,None]<b).sum(axis=0)   # even simpler
Out[41]: array([ 8, 16])

Timings are always well received (for a longish, 2E6 elements array) 时序总是很受好评(对于冗长的2E6元素数组)

In [47]: a = np.random.random(2000000)

In [48]: %timeit (a[:,None]<b).sum(axis=0)
10 loops, best of 3: 78.2 ms per loop

In [49]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
1 loop, best of 3: 448 ms per loop

For a smaller array 对于较小的阵列

In [50]: a = np.random.random(2000)

In [51]: %timeit (a[:,None]<b).sum(axis=0)
10000 loops, best of 3: 89 µs per loop

In [52]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 141 µs per loop

Edit 编辑

Divakar says that things may be different for lenghty b s, let's see Divakar说,事情可能是lenghty不同b S,让我们来看看

In [71]: a = np.random.random(2000)

In [72]: b =np.random.random(200)

In [73]: %timeit (a[:,None]<b).sum(axis=0)
1000 loops, best of 3: 1.44 ms per loop

In [74]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
10000 loops, best of 3: 172 µs per loop

quite different indeed! 确实有很大的不同! Thank you for prompting my curiosity. 谢谢您引起我的好奇。

Probably the OP should test for his use case, very long sample with respect to cutoff sequences or not? 也许OP应该测试他的用例,关于截止序列是否需要很长的样本? and where there is a balance? 哪里有平衡?


Edit #2 编辑#2

I made a blooper in my timings, I forgot the axis=0 argument to .sum() ... 我在时间上做了个大事,我忘记了.sum()axis=0参数...

I've edited the timings with the corrected statement and, of course, the corrected timing. 我已经使用更正的语句(当然也包括更正的时间)编辑了时间。 My apologies. 我很抱歉。

You can use np.searchsorted for some NumPy magic - 您可以将np.searchsorted用于NumPy magic -

# Convert to numpy array for some "magic"
pan_series_arr = np.array(pan_series)

# Let the magic begin!
sortidx = pan_series_arr.argsort()
out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)

Explanation 说明

You are performing [(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar] ie for each element in cutoff_ar , we are counting the number of pan_series elements that are lesser than it. 您正在执行[(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar]在即,对于每个元件cutoff_ar ,我们指望的数量pan_series是比它更小的元件。 Now with np.searchsorted , we are looking for cutoff_ar to be put in a sorted pan_series_arr and get the indices of such positions compared to whom the current element in cutoff_ar is at 'right' position . 现在,使用np.searchsorted ,我们正在寻找cutoff_ar放在已排序的pan_series_arr并获取这些位置的索引(与cutoff_ar当前元素位于'right'位置的人相比)。 These indices essentially represent the number of pan_series elements below the current cutoff_ar element, thus giving us our desired output. 这些索引实质上代表了当前pan_series元素之下的cutoff_ar元素的数量,从而为我们提供了所需的输出。

Sample run 样品运行

 In [302]: cutoff_ar
Out[302]: array([ 1,  3,  9, 44, 63, 90])

In [303]: pan_series_arr
Out[303]: array([ 2,  8, 69, 55, 97])

In [304]: [(pan_series_arr < cutoff_val).sum() for cutoff_val in cutoff_ar]
Out[304]: [0, 1, 2, 2, 3, 4]

In [305]: sortidx = pan_series_arr.argsort()
     ...: out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)
     ...: 

In [306]: out
Out[306]: array([0, 1, 2, 2, 3, 4])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 每个数组中的元素数量最终为何少于其父数组(数组的数组)中的元素数量 - How did the number of elements in each array end up being less than they were in their parent array (array of arrays) 计算数组中小于另一个数组中元素的元素数 - Count number of element in a array that is less than the element in another array 快速比较彼此大于或小于的numpy数组元素 - Quick comparison of numpy array elements, greater or less than each other Python将函数应用于小于0的数组的所有元素 - Python Apply Function to All Elements of Array Less Than 0 对数字列表进行分组/聚类,以使每个子集的最小-最大间隙始终小于 Python 中的截止值 - Grouping / clustering a list of numbers so that the min-max gap of each subset is always less than a cutoff in Python 用“元素编号”替换数组中的元素(Python) - Replacing Elements in an Array with its “Element Number” (Python) Python:检查元素是否小于下一个元素 - Python: Check if element is less than the next elements 如果数组 2 中的对应元素小于阈值,则将数组 1 中的元素设置为零 - Set Element in Array 1 to Zero, if corresponding Element in Array 2 is less than Threshold 统计每列中小于 x 的元素个数 - Count number of elements in each column less than x 当列索引大于每行唯一的某个截止值时,屏蔽数组条目 - Mask array entries when column index is greater than a certain cutoff that is unique to each row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM