无需排序即可获得唯一的一维 NumPy 数组值

Question

I have many large 1D arrays and I'd like to grab the unique values.我有很多大型一维数组，我想获取唯一值。 Typically, one could do:通常，人们可以这样做：

x = np.random.randint(10000, size=100000000)
np.unique(x)

However, this performs an unnecessary sort of the array.但是，这会执行不必要的数组排序。 The docs for np.unique do not mention any way to retrieve the indices without sorting. np.unique的文档没有提到任何无需排序即可检索索引的方法。 Other answers with np.unique include using return_index but, as I understand it, the array is still being sorted. np.unique其他答案包括使用return_index但据我所知，数组仍在排序中。 So, I tried using set :所以，我尝试使用set ：

set(x)

But this is way slower than sorting the array with np.unique .但这比使用np.unique对数组进行排序要慢得多。 Is there a faster way to retrieve the unique values for this array that avoids sorting and is faster than np.unique ?有没有更快的方法来检索该数组的唯一值，避免排序并且比np.unique更快？

Answer 1

If your values are positive integers in a relatively small range (eg 0 ... 10000), there is an alternative way to obtain a list of unique values using masks: (see unique2() below)如果您的值是相对较小范围内的正整数（例如 0 ... 10000），则有另一种方法可以使用掩码获取唯一值列表：（请参阅下面的unique2() ）

import numpy as np

def unique1(x):
    return np.unique(x)

def unique2(x):
    maxVal    = np.max(x)+1
    values    = np.arange(maxVal)
    used      = np.zeros(maxVal)
    used[x]   = 1
    return values[used==1]

# optimized (with option to provide known value range)
def unique3(x,maxVal=None):
    maxVal    = maxVal or np.max(x)+1
    used      = np.zeros(maxVal,dtype=np.uint8)
    used[x]   = 1
    return np.argwhere(used==1)[:,0]

In my tests this method is a lot faster than np.unique and it does not involve sorting:在我的测试中，此方法比 np.unique 快得多，并且不涉及排序：

from timeit import timeit
count = 3
x = np.random.randint(10000, size=100000000)

t = timeit(lambda:unique1(x),number=count)
print("unique1",t)

t = timeit(lambda:unique2(x),number=count)
print("unique2",t)

t = timeit(lambda:unique3(x),number=count)
print("unique3",t)

t = timeit(lambda:unique3(x,10000),number=count)
print("unique3",t, "with known value range")


# unique1 16.894681214000002
# unique2 0.8627655060000023
# unique3 0.8411087540000004
# unique3 0.5896318829999991 with known value range

Answer 2

Just in case you change your mind about dependencies, here's a dirt simple numba.njit implementation:以防万一你改变了对依赖的看法，这里有一个简单的numba.njit实现：

import numba

@numba.njit
def unique(arr):
    return np.array(list(set(arr)))


%timeit unique(x) #using Alain T.'s benchmark array
2.64 s ± 799 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.unique(x)
5.45 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not as lightning fast as Above, but doesn't require positive integer inputs, either.不像上面那样快，但也不需要正整数输入。

无需排序即可获得唯一的一维 NumPy 数组值

问题描述

2 个解决方案

解决方案1
1 2020-02-12 23:22:12

解决方案2
0 2020-02-13 10:38:15

无需排序即可获得唯一的一维 NumPy 数组值

问题描述

2 个解决方案

解决方案1 1 2020-02-12 23:22:12

解决方案2 0 2020-02-13 10:38:15

解决方案1
1 2020-02-12 23:22:12

解决方案2
0 2020-02-13 10:38:15