Python 等效於 MATLAB 的“ismember”函數

Question

經過多次嘗試優化代碼后，似乎最后一個資源是嘗試使用多個內核運行下面的代碼。 我不知道如何轉換/重新構建我的代碼，以便它可以使用多核運行得更快。 如果我能得到指導以實現最終目標，我將不勝感激。 最終目標是能夠盡可能快地為數組 A 和 B 運行此代碼，其中每個數組包含大約 700,000 個元素。 這是使用小數組的代碼。 700k 元素數組被注釋掉了。

import numpy as np

def ismember(a,b):
    for i in a:
        index = np.where(b==i)[0]
        if index.size == 0:
            yield 0
        else:
            yield index


def f(A, gen_obj):
    my_array = np.arange(len(A))
    for i in my_array:
        my_array[i] = gen_obj.next()
    return my_array


#A = np.arange(700000)
#B = np.arange(700000)
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])

gen_obj = ismember(A,B)

f(A, gen_obj)

print 'done'
# if we print f(A, gen_obj) the output will be: [4 0 0 4 3]
# notice that the output array needs to be kept the same size as array A.

我想要做的是模仿一個名為ismember [2] 的 MATLAB 函數（格式為： [Lia,Locb] = ismember(A,B) 。我只是想獲得Locb部分。

來自 Matlab：Locb，對於 A 中屬於 B 的成員的每個值，包含 B 中的最低索引。輸出數組 Locb 在 A 不是 B 的成員的地方包含 0

主要問題之一是我需要能夠盡可能高效地執行此操作。 為了測試，我有兩個 700k 元素的數組。 創建一個生成器並檢查生成器的值似乎並不能快速完成工作。

Answer 1

在擔心多核之前，我將使用字典消除 ismember 函數中的線性掃描：

def ismember(a, b):
    bind = {}
    for i, elt in enumerate(b):
        if elt not in bind:
            bind[elt] = i
    return [bind.get(itm, None) for itm in a]  # None can be replaced by any other "not in b" value

您的原始實現需要對 A 中的每個元素對 B 中的元素進行全面掃描，使其成為O(len(A)*len(B)) 。 上面的代碼需要對 B 進行一次完整掃描才能生成 dict Bset。 通過使用字典，您可以有效地為 A 的每個元素查找 B 中的每個元素，從而使操作O(len(A)+len(B)) 。 如果這仍然太慢，那么請擔心讓上述功能在多個內核上運行。

編輯：我還稍微修改了您的索引。 Matlab 使用 0，因為它的所有數組都從索引 1 開始。Python/numpy 從 0 開始數組，所以如果你的數據集看起來像這樣

A = [2378, 2378, 2378, 2378]
B = [2378, 2379]

並且您返回 0 表示沒有元素，那么您的結果將排除 A 的所有元素。上述例程返回None表示沒有索引而不是 0。返回 -1 是一個選項，但 Python 會將其解釋為數組中的最后一個元素. 如果None用作數組的索引，則將引發異常。 如果您想要不同的行為，請將Bind.get(item,None)表達式中的第二個參數更改為您想要返回的值。

Answer 2

sfstewman 的出色回答很可能為您解決了這個問題。

我只想補充一下如何在 numpy 中實現相同的功能。

我利用了 numpy 獨特的in1d函數。

B_unique_sorted, B_idx = np.unique(B, return_index=True)
B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)

B_unique_sorted包含B排序的唯一值。
B_idx為這些值保存原始B的索引。
B_in_A_bool是一個布爾數組的大小B_unique_sorted ，存儲是否在一個值B_unique_sorted是在A 。
注意：我需要在 A 中查找（來自 B 的唯一值），因為我需要返回關於B_idx的輸出
注意：我假設A已經是唯一的。

現在您可以使用B_in_A_bool來獲取公共 vals

B_unique_sorted[B_in_A_bool]

以及它們各自在原始B索引

B_idx[B_in_A_bool]

最后，我認為這比純 Python for 循環要快得多，盡管我沒有對其進行測試。

Answer 3

試試ismember庫。

pip install ismember

簡單的例子：

# Import library
from ismember import ismember
import numpy as np

# data
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])

# Lookup
Iloc,idx = ismember(A, B)
 
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [ True False False  True  True]

# indexes of d_unique that exists in d
print(idx)
# [4 4 3]

print(B[idx])
# [3 3 6]

print(A[Iloc])
# [3 3 6]

# These vectors will match
A[Iloc]==B[idx]

速度檢查：

from ismember import ismember
from datetime import datetime

t1=[]
t2=[]
# Create some random vectors
ns = np.random.randint(10,10000,1000)

for n in ns:
    a_vec = np.random.randint(0,100,n)
    b_vec = np.random.randint(0,100,n)

    # Run stack version
    start = datetime.now()
    out1=ismember_stack(a_vec, b_vec)
    end = datetime.now()
    t1.append(end - start)

    # Run ismember
    start = datetime.now()
    out2=ismember(a_vec, b_vec)
    end = datetime.now()
    t2.append(end - start)


print(np.sum(t1))
# 0:00:07.778331

print(np.sum(t2))
# 0:00:04.609801

# %%
def ismember_stack(a, b):
    bind = {}
    for i, elt in enumerate(b):
        if elt not in bind:
            bind[elt] = i
    return [bind.get(itm, None) for itm in a]  # None can be replaced by any other "not in b" value

pypi 的ismember函數幾乎快了 2 倍。

大向量，例如 700000 個元素：

from ismember import ismember
from datetime import datetime

A = np.random.randint(0,100,700000)
B = np.random.randint(0,100,700000)

# Lookup
start = datetime.now()
Iloc,idx = ismember(A, B)
end = datetime.now()

# Print time
print(end-start)
# 0:00:01.194801

Answer 4

嘗試使用列表理解；

In [1]: import numpy as np

In [2]: A = np.array([3,4,4,3,6])

In [3]: B = np.array([2,5,2,6,3])

In [4]: [x for x in A if not x in B]
Out[4]: [4, 4]

通常，列表推導式比 for 循環快得多。

得到一個等長的列表；

In [19]: map(lambda x: x if x not in B else False, A)
Out[19]: [False, 4, 4, False, False]

這對於小數據集來說非常快：

In [20]: C = np.arange(10000)

In [21]: D = np.arange(15000, 25000)

In [22]: %timeit map(lambda x: x if x not in D else False, C)
1 loops, best of 3: 756 ms per loop

對於大型數據集，您可以嘗試使用multiprocessing.Pool.map()來加速操作。

Answer 5

這是准確的 MATLAB 等效項，它返回與 MATLAB 匹配的兩個輸出參數 [Lia, Locb]，但在 Python 0 中也是有效索引。 因此，此函數不返回 0。 它本質上返回 Locb(Locb>0)。 性能也與MATLAB相當。

def ismember(a_vec, b_vec):
    """ MATLAB equivalent ismember function """

    bool_ind = np.isin(a_vec,b_vec)
    common = a[bool_ind]
    common_unique, common_inv  = np.unique(common, return_inverse=True)     # common = common_unique[common_inv]
    b_unique, b_ind = np.unique(b_vec, return_index=True)  # b_unique = b_vec[b_ind]
    common_ind = b_ind[np.isin(b_unique, common_unique, assume_unique=True)]
    return bool_ind, common_ind[common_inv]

一個有點（~5x）慢但不使用獨特功能的替代實現在這里：

def ismember(a_vec, b_vec):
    ''' MATLAB equivalent ismember function. Slower than above implementation'''
    b_dict = {b_vec[i]: i for i in range(0, len(b_vec))}
    indices = [b_dict.get(x) for x in a_vec if b_dict.get(x) is not None]
    booleans = np.in1d(a_vec, b_vec)
    return booleans, np.array(indices, dtype=int)

Python 等效於 MATLAB 的“ismember”函數

問題描述

5 個解決方案

解決方案1
18 已采納 2013-04-07 15:59:41

解決方案2
15 2013-04-07 19:30:59

解決方案3
2 2020-06-21 22:07:50

解決方案4
1 2013-04-07 15:57:50

解決方案5
1 2017-09-17 20:23:27

Python 等效於 MATLAB 的“ismember”函數

問題描述

5 個解決方案

解決方案1 18 已采納 2013-04-07 15:59:41

解決方案2 15 2013-04-07 19:30:59

解決方案3 2 2020-06-21 22:07:50

解決方案4 1 2013-04-07 15:57:50

解決方案5 1 2017-09-17 20:23:27

解決方案1
18 已采納 2013-04-07 15:59:41

解決方案2
15 2013-04-07 19:30:59

解決方案3
2 2020-06-21 22:07:50

解決方案4
1 2013-04-07 15:57:50

解決方案5
1 2017-09-17 20:23:27