[英]Getting Unique 1D NumPy Array Values without Sorting
I have many large 1D arrays and I'd like to grab the unique values.我有很多大型一维数组,我想获取唯一值。 Typically, one could do:通常,人们可以这样做:
x = np.random.randint(10000, size=100000000)
np.unique(x)
However, this performs an unnecessary sort of the array.但是,这会执行不必要的数组排序。 The docs for np.unique
do not mention any way to retrieve the indices without sorting. np.unique
的文档没有提到任何无需排序即可检索索引的方法。 Other answers with np.unique
include using return_index
but, as I understand it, the array is still being sorted. np.unique
其他答案包括使用return_index
但据我所知,数组仍在排序中。 So, I tried using set
:所以,我尝试使用set
:
set(x)
But this is way slower than sorting the array with np.unique
.但这比使用np.unique
对数组进行排序要慢得多。 Is there a faster way to retrieve the unique values for this array that avoids sorting and is faster than np.unique
?有没有更快的方法来检索该数组的唯一值,避免排序并且比np.unique
更快?
If your values are positive integers in a relatively small range (eg 0 ... 10000), there is an alternative way to obtain a list of unique values using masks: (see unique2()
below)如果您的值是相对较小范围内的正整数(例如 0 ... 10000),则有另一种方法可以使用掩码获取唯一值列表:(请参阅下面的unique2()
)
import numpy as np
def unique1(x):
return np.unique(x)
def unique2(x):
maxVal = np.max(x)+1
values = np.arange(maxVal)
used = np.zeros(maxVal)
used[x] = 1
return values[used==1]
# optimized (with option to provide known value range)
def unique3(x,maxVal=None):
maxVal = maxVal or np.max(x)+1
used = np.zeros(maxVal,dtype=np.uint8)
used[x] = 1
return np.argwhere(used==1)[:,0]
In my tests this method is a lot faster than np.unique and it does not involve sorting:在我的测试中,此方法比 np.unique 快得多,并且不涉及排序:
from timeit import timeit
count = 3
x = np.random.randint(10000, size=100000000)
t = timeit(lambda:unique1(x),number=count)
print("unique1",t)
t = timeit(lambda:unique2(x),number=count)
print("unique2",t)
t = timeit(lambda:unique3(x),number=count)
print("unique3",t)
t = timeit(lambda:unique3(x,10000),number=count)
print("unique3",t, "with known value range")
# unique1 16.894681214000002
# unique2 0.8627655060000023
# unique3 0.8411087540000004
# unique3 0.5896318829999991 with known value range
Just in case you change your mind about dependencies, here's a dirt simple numba.njit
implementation:以防万一你改变了对依赖的看法,这里有一个简单的numba.njit
实现:
import numba
@numba.njit
def unique(arr):
return np.array(list(set(arr)))
%timeit unique(x) #using Alain T.'s benchmark array
2.64 s ± 799 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(x)
5.45 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Not as lightning fast as Above, but doesn't require positive integer inputs, either.不像上面那样快,但也不需要正整数输入。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.