切片ndarray的最快方法

Question

I have some event data from a HDF5 file: 我有一些来自HDF5文件的事件数据：

>>> events
<class 'h5py._hl.dataset.Dataset'>

I get the array data like so: 我得到这样的数组数据：

>>> events = events[:]

And the structure is like so: 结构如下：

>>> type(events)
<type 'numpy.ndarray'>
>>> events.shape
(273856,)
>>> type(events[0])
<type 'numpy.void'>
>>> events[0]
(0, 30, 3523, 5352)
>>> # More information on structure 
>>> [type(e) for e in events[0]]    
[<type 'numpy.uint64'>, 
 <type 'numpy.uint32'>, 
 <type 'numpy.float64'>, 
 <type 'numpy.float64'>]   
>>> events.dtype 
[('start', '<u8'), 
 ('length', '<u4'), 
 ('mean', '<f8'), 
 ('variance', '<f8')]

I need to get the largest index of a particular event where the first field is less than some value. 我需要得到特定事件的最大索引，其中第一个字段小于某个值。 The brute force approach is: 蛮力方法是：

>>> for i, e in enumerate(events):
>>>     if e[0] >= val:
>>>         break

The first index of the tuple is sorted so I can do bisection so speed things up: 元组的第一个索引是排序的，所以我可以做二分，所以加快速度：

>>> field1 = [row[0] for row in events]
>>> index = bisect.bisect_right(field1, val)

This show improvement but [row[0] for row in event] is slower than I expected. 这显示了改进，但[row[0] for row in event]比我预期的要慢。 Any ideas on how to tackle this problem? 关于如何解决这个问题的任何想法？

Answer 1

Yep, iterating over numpy arrays as you're currently doing is relatively slow. 是的，当你正在做的时，迭代numpy数组相对较慢。 Normally, you'd use slicing instead (which creates a view, rather than copying the data into a list). 通常，您将使用切片（创建视图，而不是将数据复制到列表中）。

It looks like you have an object array. 看起来你有一个对象数组。 This will make things even slower. 这会让事情变得更慢。 Do you really need an object array? 你真的需要一个对象数组吗？ It looks like all of the values are int s. 看起来所有的值都是int 。 (Is this a "vlen" hdf5 dataset?) （这是一个“vlen”hdf5数据集吗？）

The use case where an object array would make sense is if you have a different number of items in each element of events . 对象数组有意义的用例是，如果events每个元素中有不同数量的项。 If you don't, then there's no reason to use one. 如果你不这样做，那么没有理由使用它。

If you were using a 2D array of ints instead of an object array of tuples, you'd just do: 如果您使用的是int数组而不是元组的对象数组，那么您只需执行以下操作：

field1 = events[:,0]

However, in that case, you could just do: ( searchsorted uses bisection) 但是，在这种情况下，您可以这样做：（ searchsorted使用二分）

index = np.searchsorted(events[:,0], val)

Edit 编辑

Ah! 啊! Okay, you have a structured array . 好的，你有一个结构化的数组。 In other words, it's an array (1D, in this case) where each item is a C-like struct. 换句话说，它是一个数组（在这种情况下为1D），其中每个项都是类似C的结构。 From: 从：

>>> events.dtype 
[('start', '<u8'), 
 ('length', '<u4'), 
 ('mean', '<f8'), 
 ('variance', '<f8')]

...we can see that the first field is named "start". ......我们可以看到第一个字段被命名为“start”。

Therefore, you just want: 因此，您只需要：

index = np.searchsorted(events["start"], val)

In more general terms, if we didn't know the name of the field, but knew that it was a structured array of some sort, you'd do (paring things down to just the slicing step): 更一般地说，如果我们不知道该字段的名称，但知道它是某种类型的结构化数组，那么你可以做（将事情简化为切片步骤）：

events[event.dtype.names[0]]

As far as whether or not it's a good idea to convert everything to a "normal" 2D array of ints, that depends on your use case. 至于将所有内容转换为“普通”2D整数数组是一个好主意，这取决于您的用例。 For basic slicing and calling searchsorted , there's no reason to. 对于基本切片和调用searchsorted ，没有理由。 There shouldn't (untested) be any significant speed increase. 不应该（未经测试）任何显着的速度增加。

Based on what you're doing at the moment, I'd just leave it as is. 根据你目前正在做的事情，我只是保持原样。

However, structured arrays are often cumbersome to deal with. 但是，结构化数组通常很难处理。

There are plenty of cases where structured arrays are very useful (eg reading in certain binary formats from disk), but if you want to think of it as a "table-like" array, you'll quickly hit pain points. 有很多情况下结构化数组是非常有用的（例如从磁盘读取某些二进制格式），但如果你想把它想象成一个“类似表”的数组，你很快就会遇到痛点。 You're often better off storing the columns as separate arrays. 您通常最好将列存储为单独的数组。 (Or better yet, use a pandas.DataFrame for "tabular" data.) （或者更好的是，使用pandas.DataFrame表示“表格”数据。）

If you did want to convert it to a 2D array of ints, do: 如果您确实想将其转换为2D数组，请执行以下操作：

events = np.hstack([events[name] for name in events.dtype.names])

This will automatically find a compatible datatype ( int64 , in this case) for the new array and "stack" the fields of the structured array into columns in a 2D array. 这将自动为新数组找到兼容的数据类型（在本例中为int64 ），并将结构化数组的字段“堆叠”为2D数组中的列。

Calling events = events.astype(int) will effectively just yield the first column. 调用events = events.astype(int)将有效地产生第一列。 (This is because each item of events is a C-like struct, and astype operates element-wise, so each struct is converted to a single int.) （这是因为每个事件项都是一个类似C的结构，而astype是按元素运行的，因此每个结构都转换为单个int。）

Answer 2

You can use numpy.searchsorted : 你可以使用numpy.searchsorted ：

>>> a = np.arange(10000).reshape(2500,4)
>>> np.searchsorted(a[:,0], 1000)
250

Timing comparisons : 时间比较 ：

>>> %timeit np.searchsorted(a[:,0], 1000)
100000 loops, best of 3: 11.7 µs per loop
>>> %timeit field1 = [row[0] for row in a];bisect.bisect_right(field1, 1000)
100 loops, best of 3: 2.63 ms per loop

切片ndarray的最快方法

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-01-17 17:56:34

解决方案2
2 2014-01-17 17:55:41

切片ndarray的最快方法

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-01-17 17:56:34

解决方案2 2 2014-01-17 17:55:41

解决方案1
4 已采纳 2014-01-17 17:56:34

解决方案2
2 2014-01-17 17:55:41