简体   繁体   English

无需排序即可获得唯一的一维 NumPy 数组值

[英]Getting Unique 1D NumPy Array Values without Sorting

I have many large 1D arrays and I'd like to grab the unique values.我有很多大型一维数组,我想获取唯一值。 Typically, one could do:通常,人们可以这样做:

x = np.random.randint(10000, size=100000000)
np.unique(x)

However, this performs an unnecessary sort of the array.但是,这会执行不必​​要的数组排序。 The docs for np.unique do not mention any way to retrieve the indices without sorting. np.unique的文档没有提到任何无需排序即可检索索引的方法。 Other answers with np.unique include using return_index but, as I understand it, the array is still being sorted. np.unique其他答案包括使用return_index但据我所知,数组仍在排序中。 So, I tried using set :所以,我尝试使用set

set(x)

But this is way slower than sorting the array with np.unique .但这比使用np.unique对数组进行排序要慢得多。 Is there a faster way to retrieve the unique values for this array that avoids sorting and is faster than np.unique ?有没有更快的方法来检索该数组的唯一值,避免排序并且比np.unique更快?

If your values are positive integers in a relatively small range (eg 0 ... 10000), there is an alternative way to obtain a list of unique values using masks: (see unique2() below)如果您的值是相对较小范围内的正整数(例如 0 ... 10000),则有另一种方法可以使用掩码获取唯一值列表:(请参阅下面的unique2()

import numpy as np

def unique1(x):
    return np.unique(x)

def unique2(x):
    maxVal    = np.max(x)+1
    values    = np.arange(maxVal)
    used      = np.zeros(maxVal)
    used[x]   = 1
    return values[used==1]

# optimized (with option to provide known value range)
def unique3(x,maxVal=None):
    maxVal    = maxVal or np.max(x)+1
    used      = np.zeros(maxVal,dtype=np.uint8)
    used[x]   = 1
    return np.argwhere(used==1)[:,0]

In my tests this method is a lot faster than np.unique and it does not involve sorting:在我的测试中,此方法比 np.unique 快得多,并且不涉及排序:

from timeit import timeit
count = 3
x = np.random.randint(10000, size=100000000)

t = timeit(lambda:unique1(x),number=count)
print("unique1",t)

t = timeit(lambda:unique2(x),number=count)
print("unique2",t)

t = timeit(lambda:unique3(x),number=count)
print("unique3",t)

t = timeit(lambda:unique3(x,10000),number=count)
print("unique3",t, "with known value range")


# unique1 16.894681214000002
# unique2 0.8627655060000023
# unique3 0.8411087540000004
# unique3 0.5896318829999991 with known value range

Just in case you change your mind about dependencies, here's a dirt simple numba.njit implementation:以防万一你改变了对依赖的看法,这里有一个简单的numba.njit实现:

import numba

@numba.njit
def unique(arr):
    return np.array(list(set(arr)))


%timeit unique(x) #using Alain T.'s benchmark array
2.64 s ± 799 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.unique(x)
5.45 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not as lightning fast as Above, but doesn't require positive integer inputs, either.不像上面那样快,但也不需要正整数输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM