NumPy 數組中每行唯一元素的數量

Question

例如，對於

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

我想得到

[2, 2, 3]

有沒有辦法在沒有 for 循環或使用np.vectorize情況下做到這np.vectorize ？

編輯：實際數據由 1000 行組成，每行 100 個元素，每個元素的范圍從 1 到 365。最終目標是確定具有重復的行的百分比。 這是我已經解決的作業問題（使用 for 循環），但我只是想知道是否有更好的方法來使用 numpy.

Answer 1

方法#1

一種帶有排序的矢量化方法 -

In [8]: b = np.sort(a,axis=1)

In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])

方法#2

對於不是很大的ints另一種方法是通過偏移量偏移每行，該偏移量可以將每行中的元素與其他元素區分開來，然后進行分箱求和並計算每行非零分箱的數量 -

n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

運行時測試

作為函數的方法 -

def sorting(a):
    b = np.sort(a,axis=1)
    return (b[:,1:] != b[:,:-1]).sum(axis=1)+1

def bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[0])[:,None])*n
    M = a.shape[0]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

# From @wim's post   
def pandas(a):
    df = pd.DataFrame(a.T)
    return df.nunique()

# @jp_data_analysis's soln
def numpy_apply(a):
    return np.apply_along_axis(compose(len, np.unique), 1, a)

案例#1：方形的

In [164]: np.random.seed(0)

In [165]: a = np.random.randint(0,5,(10000,10000))

In [166]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop

案例#2：大量行

In [167]: np.random.seed(0)

In [168]: a = np.random.randint(0,5,(1000000,10))

In [169]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop

擴展到每列的唯一元素數

為了擴展，我們只需要對兩種建議的方法沿另一個軸進行切片和 ufunc 操作，就像這樣 -

def nunique_percol_sort(a):
    b = np.sort(a,axis=0)
    return (b[1:] != b[:-1]).sum(axis=0)+1

def nunique_percol_bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[1]))*n
    M = a.shape[1]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

具有通用軸的通用 ndarray

讓我們看看我們如何擴展到通用維度的 ndarray 並沿通用軸獲得這些唯一計數的數量。 我們將使用np.diff及其axis參數來獲取這些連續的差異，從而使其成為通用的，就像這樣 -

def nunique(a, axis):
    return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1

樣品運行 -

In [77]: a
Out[77]: 
array([[1, 0, 2, 2, 0],
       [1, 0, 1, 2, 0],
       [0, 0, 0, 0, 2],
       [1, 2, 1, 0, 1],
       [2, 0, 1, 0, 0]])

In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])

In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])

如果您正在使用浮動 pt 數字並希望根據某些容差值而不是絕對匹配來制作唯一性案例，我們可以使用np.isclose 。 兩個這樣的選擇是 -

(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)

對於自定義容差值，請使用np.isclose提供這些值。

Answer 2

這個通過np.apply_along_axis解決方案不是矢量化的，並且涉及 Python 級循環。 但是使用len + np.unique函數比較直觀。

import numpy as np
from toolz import compose

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

np.apply_along_axis(compose(len, np.unique), 1, a)    # [2, 2, 3]

Answer 3

你願意考慮熊貓嗎？ 數據幀有一個專門的方法來解決這個問題

>>> a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
>>> df = pd.DataFrame(a.T)
>>> print(*df.nunique())
2 2 3

Answer 4

使用排序的oneliner：

In [6]: np.count_nonzero(np.diff(np.sort(a)), axis=1)+1
Out[6]: array([2, 2, 3])

NumPy 數組中每行唯一元素的數量

問題描述

4 個解決方案

解決方案1
9 已采納 2018-01-27 06:08:52

運行時測試

具有通用軸的通用 ndarray

解決方案2
4 2018-01-27 06:03:46

解決方案3
2 2018-01-27 06:28:08

解決方案4
1 2018-07-12 14:38:34

NumPy 數組中每行唯一元素的數量

問題描述

4 個解決方案

解決方案1 9 已采納 2018-01-27 06:08:52

運行時測試

具有通用軸的通用 ndarray

解決方案2 4 2018-01-27 06:03:46

解決方案3 2 2018-01-27 06:28:08

解決方案4 1 2018-07-12 14:38:34

解決方案1
9 已采納 2018-01-27 06:08:52

解決方案2
4 2018-01-27 06:03:46

解決方案3
2 2018-01-27 06:28:08

解決方案4
1 2018-07-12 14:38:34