[英]Numpy first occurrence of value greater than existing value
I have a 1D array in numpy and I want to find the position of the index where a value exceeds the value in numpy array.我在 numpy 中有一个一维数组,我想找到一个值超过 numpy 数组中的值的索引位置。
Eg例如
aa = range(-10,10)
Find position in aa
where, the value 5
gets exceeded.在aa
中找到超过值5
位置。
This is a little faster (and looks nicer)这有点快(而且看起来更好)
np.argmax(aa>5)
Since argmax
will stop at the first True
("In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.") and doesn't save another list.由于argmax
将在第一个True
处停止(“如果多次出现最大值,则返回与第一次出现对应的索引。”)并且不保存另一个列表。
In [2]: N = 10000
In [3]: aa = np.arange(-N,N)
In [4]: timeit np.argmax(aa>N/2)
100000 loops, best of 3: 52.3 us per loop
In [5]: timeit np.where(aa>N/2)[0][0]
10000 loops, best of 3: 141 us per loop
In [6]: timeit np.nonzero(aa>N/2)[0][0]
10000 loops, best of 3: 142 us per loop
given the sorted content of your array, there is an even faster method: searchsorted .给定数组的排序内容,还有一种更快的方法: searchsorted 。
import time
N = 10000
aa = np.arange(-N,N)
%timeit np.searchsorted(aa, N/2)+1
%timeit np.argmax(aa>N/2)
%timeit np.where(aa>N/2)[0][0]
%timeit np.nonzero(aa>N/2)[0][0]
# Output
100000 loops, best of 3: 5.97 µs per loop
10000 loops, best of 3: 46.3 µs per loop
10000 loops, best of 3: 154 µs per loop
10000 loops, best of 3: 154 µs per loop
I was also interested in this and I've compared all the suggested answers with perfplot .我也对此很感兴趣,并且将所有建议的答案与perfplot 进行了比较。 (Disclaimer: I'm the author of perfplot.) (免责声明:我是 perfplot 的作者。)
If you know that the array you're looking through is already sorted , then如果您知道您正在查看的数组已经排序,那么
numpy.searchsorted(a, alpha)
is for you.是给你的。 It's O(log(n)) operation, ie, the speed hardly depends on the size of the array.它是 O(log(n)) 操作,即速度几乎不取决于数组的大小。 You can't get faster than that.你不能比这更快。
If you don't know anything about your array, you're not going wrong with如果您对阵列一无所知,那么您就不会出错
numpy.argmax(a > alpha)
Already sorted:已经排序:
Unsorted:未分类:
Code to reproduce the plot:重现情节的代码:
import numpy
import perfplot
alpha = 0.5
numpy.random.seed(0)
def argmax(data):
return numpy.argmax(data > alpha)
def where(data):
return numpy.where(data > alpha)[0][0]
def nonzero(data):
return numpy.nonzero(data > alpha)[0][0]
def searchsorted(data):
return numpy.searchsorted(data, alpha)
perfplot.save(
"out.png",
# setup=numpy.random.rand,
setup=lambda n: numpy.sort(numpy.random.rand(n)),
kernels=[argmax, where, nonzero, searchsorted],
n_range=[2 ** k for k in range(2, 23)],
xlabel="len(array)",
)
In [34]: a=np.arange(-10,10)
In [35]: a
Out[35]:
array([-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9])
In [36]: np.where(a>5)
Out[36]: (array([16, 17, 18, 19]),)
In [37]: np.where(a>5)[0][0]
Out[37]: 16
In case of a range
or any other linearly increasing array you can simply calculate the index programmatically, no need to actually iterate over the array at all:如果是range
或任何其他线性增加的数组,您可以简单地以编程方式计算索引,根本不需要实际遍历数组:
def first_index_calculate_range_like(val, arr):
if len(arr) == 0:
raise ValueError('no value greater than {}'.format(val))
elif len(arr) == 1:
if arr[0] > val:
return 0
else:
raise ValueError('no value greater than {}'.format(val))
first_value = arr[0]
step = arr[1] - first_value
# For linearly decreasing arrays or constant arrays we only need to check
# the first element, because if that does not satisfy the condition
# no other element will.
if step <= 0:
if first_value > val:
return 0
else:
raise ValueError('no value greater than {}'.format(val))
calculated_position = (val - first_value) / step
if calculated_position < 0:
return 0
elif calculated_position > len(arr) - 1:
raise ValueError('no value greater than {}'.format(val))
return int(calculated_position) + 1
One could probably improve that a bit.人们可能会稍微改进一下。 I have made sure it works correctly for a few sample arrays and values but that doesn't mean there couldn't be mistakes in there, especially considering that it uses floats...我已经确保它对一些样本数组和值正常工作,但这并不意味着那里不会有错误,特别是考虑到它使用浮点数......
>>> import numpy as np
>>> first_index_calculate_range_like(5, np.arange(-10, 10))
16
>>> np.arange(-10, 10)[16] # double check
6
>>> first_index_calculate_range_like(4.8, np.arange(-10, 10))
15
Given that it can calculate the position without any iteration it will be constant time ( O(1)
) and can probably beat all other mentioned approaches.鉴于它可以在没有任何迭代的情况下计算位置,它将是恒定时间( O(1)
)并且可能会击败所有其他提到的方法。 However it requires a constant step in the array, otherwise it will produce wrong results.但是它需要在数组中保持一个恒定的步长,否则会产生错误的结果。
A more general approach would be using a numba function:更通用的方法是使用 numba 函数:
@nb.njit
def first_index_numba(val, arr):
for idx in range(len(arr)):
if arr[idx] > val:
return idx
return -1
That will work for any array but it has to iterate over the array, so in the average case it will be O(n)
:这适用于任何数组,但它必须遍历数组,所以在平均情况下它将是O(n)
:
>>> first_index_numba(4.8, np.arange(-10, 10))
15
>>> first_index_numba(5, np.arange(-10, 10))
16
Even though Nico Schlömer already provided some benchmarks I thought it might be useful to include my new solutions and to test for different "values".尽管 Nico Schlömer 已经提供了一些基准,但我认为包含我的新解决方案并测试不同的“值”可能会很有用。
The test setup:测试设置:
import numpy as np
import math
import numba as nb
def first_index_using_argmax(val, arr):
return np.argmax(arr > val)
def first_index_using_where(val, arr):
return np.where(arr > val)[0][0]
def first_index_using_nonzero(val, arr):
return np.nonzero(arr > val)[0][0]
def first_index_using_searchsorted(val, arr):
return np.searchsorted(arr, val) + 1
def first_index_using_min(val, arr):
return np.min(np.where(arr > val))
def first_index_calculate_range_like(val, arr):
if len(arr) == 0:
raise ValueError('empty array')
elif len(arr) == 1:
if arr[0] > val:
return 0
else:
raise ValueError('no value greater than {}'.format(val))
first_value = arr[0]
step = arr[1] - first_value
if step <= 0:
if first_value > val:
return 0
else:
raise ValueError('no value greater than {}'.format(val))
calculated_position = (val - first_value) / step
if calculated_position < 0:
return 0
elif calculated_position > len(arr) - 1:
raise ValueError('no value greater than {}'.format(val))
return int(calculated_position) + 1
@nb.njit
def first_index_numba(val, arr):
for idx in range(len(arr)):
if arr[idx] > val:
return idx
return -1
funcs = [
first_index_using_argmax,
first_index_using_min,
first_index_using_nonzero,
first_index_calculate_range_like,
first_index_numba,
first_index_using_searchsorted,
first_index_using_where
]
from simple_benchmark import benchmark, MultiArgument
and the plots were generated using:并且使用以下方法生成图:
%matplotlib notebook
b.plot()
b = benchmark(
funcs,
{2**i: MultiArgument([0, np.arange(2**i)]) for i in range(2, 20)},
argument_name="array size")
The numba function performs best followed by the calculate-function and the searchsorted function. numba 函数表现最好,其次是计算函数和搜索排序函数。 The other solutions perform much worse.其他解决方案的表现要差得多。
b = benchmark(
funcs,
{2**i: MultiArgument([2**i-2, np.arange(2**i)]) for i in range(2, 20)},
argument_name="array size")
For small arrays the numba function performs amazingly fast, however for bigger arrays it's outperformed by the calculate-function and the searchsorted function.对于小数组,numba 函数的执行速度非常快,但是对于较大的数组,它的计算函数和 searchsorted 函数的性能要好得多。
b = benchmark(
funcs,
{2**i: MultiArgument([np.sqrt(2**i), np.arange(2**i)]) for i in range(2, 20)},
argument_name="array size")
This is more interesting.这更有趣。 Again numba and the calculate function perform great, however this is actually triggering the worst case of searchsorted which really doesn't work well in this case.再次 numba 和计算函数表现很好,但是这实际上触发了 searchsorted 的最坏情况,在这种情况下它确实不能很好地工作。
Another interesting point is how these function behave if there is no value whose index should be returned:另一个有趣的一点是,如果没有应返回其索引的值,这些函数的行为如何:
arr = np.ones(100)
value = 2
for func in funcs:
print(func.__name__)
try:
print('-->', func(value, arr))
except Exception as e:
print('-->', e)
With this result:有了这个结果:
first_index_using_argmax
--> 0
first_index_using_min
--> zero-size array to reduction operation minimum which has no identity
first_index_using_nonzero
--> index 0 is out of bounds for axis 0 with size 0
first_index_calculate_range_like
--> no value greater than 2
first_index_numba
--> -1
first_index_using_searchsorted
--> 101
first_index_using_where
--> index 0 is out of bounds for axis 0 with size 0
Searchsorted, argmax, and numba simply return a wrong value. Searchsorted、argmax 和 numba 只会返回错误的值。 However searchsorted
and numba
return an index that is not a valid index for the array.但是searchsorted
和numba
返回的索引不是数组的有效索引。
The functions where
, min
, nonzero
and calculate
throw an exception. where
, min
, nonzero
和calculate
函数抛出异常。 However only the exception for calculate
actually says anything helpful.然而,只有calculate
的例外实际上说明了任何有用的东西。
That means one actually has to wrap these calls in an appropriate wrapper function that catches exceptions or invalid return values and handle appropriately, at least if you aren't sure if the value could be in the array.这意味着人们实际上必须将这些调用包装在一个适当的包装函数中,该函数捕获异常或无效的返回值并进行适当的处理,至少在您不确定该值是否在数组中的情况下。
Note: The calculate and searchsorted
options only work in special conditions.注意:calculate 和searchsorted
选项仅适用于特殊条件。 The "calculate" function requires a constant step and the searchsorted requires the array to be sorted. "calculate" 函数需要一个恒定的步骤,而 searchsorted 需要对数组进行排序。 So these could be useful in the right circumstances but aren't general solutions for this problem.所以这些在正确的情况下可能很有用,但不是这个问题的通用解决方案。 In case you're dealing with sorted Python lists you might want to take a look at the bisect module instead of using Numpys searchsorted.如果您正在处理已排序的Python 列表,您可能需要查看bisect模块而不是使用 Numpys searchsorted。
I'd like to propose我想提议
np.min(np.append(np.where(aa>5)[0],np.inf))
This will return the smallest index where the condition is met, while returning infinity if the condition is never met (and where
returns an empty array).这将返回满足条件的最小索引,如果从未满足条件则返回无穷大(并且where
返回一个空数组)。
I would go with我会和
i = np.min(np.where(V >= x))
where V
is vector (1d array), x
is the value and i
is the resulting index.其中V
是向量(一V
数组), x
是值, i
是结果索引。
You should use np.where
instead of np.argmax
.您应该使用np.where
而不是np.argmax
。 The latter will return position 0 even if no value is found, which is not the indexes you expect.即使没有找到值,后者也会返回位置 0,这不是您期望的索引。
>>> aa = np.array(range(-10,10))
>>> print(aa)
array([-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9])
If the condition is met, it returns an array of the indexes.如果满足条件,则返回索引数组。
>>> idx = np.where(aa > 5)[0]
>>> print(idx)
array([16, 17, 18, 19], dtype=int64)
Otherwise, if not met, it returns an empty array.否则,如果不满足,则返回一个空数组。
>>> not_found = len(np.where(aa > 20)[0])
>>> print(not_found)
array([], dtype=int64)
The point against argmax
for this case is: the simpler the best, IF the solution is not ambiguous .对于这种情况,反对argmax
的要点是:如果解决方案没有歧义,越简单argmax
。 So, to check if something fell into the condition, just do a if len(np.where(aa > value_to_search)[0]) > 0
.因此,要检查是否有某些内容符合条件,只需执行if len(np.where(aa > value_to_search)[0]) > 0
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.