转换为 numpy 中的索引数组

Question

类似于retrun_inverse中的numpy.unique ，

如果我有一个 numpy 数组 a: [['a' 'b'] ['b' 'c'] ['c' 'c'] ['c' 'b']] ，

我想将数组 b: [['b' 'c'] ['a' 'b'] ['c' 'c'] ['a' 'b'] ['c' 'c']]转换为[1 0 2 0 2] 。

有没有聪明的方法来转换它？

Answer 1

也许使用普通的list更容易做到这一点（您可以使用.tolist()方法从 NumPy arrays 获得）：

a = [['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']]
b = [['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']]

print([a.index(x) for x in b])
# [1, 0, 2, 0, 2]

或者，将其写为 function 并假设 NumPy arrays 输入和输出并处理针不在大海捞针的情况：

import numpy as np


def find_by_list(haystack, needles):
    haystack = haystack.tolist()
    result = []
    for needle in needles.tolist():
        try:
            result.append(haystack.index(needle))
        except ValueError:
            result.append(-1)
    return np.array(result)

这大约与基于np.where()的 NumPy 感知解决方案一样快（假设 np.all np.all() ) 操作的减少可以在第一个轴上完成），例如：

import numpy as np


def find_by_np(haystack, needles, haystack_axis=-1, needles_axis=-1, keepdims=False):
    if haystack_axis:
        haystack = haystack.swapaxes(0, haystack_axis)
    if needles_axis:
        needles = needles.swapaxes(0, needles_axis)
    n = needles.shape[0]
    m = haystack.ndim - 1
    shape = haystack.shape[1:]
    result = np.full((m,) + needles.shape[1:], -1)
    haystack = haystack.reshape(n, -1)
    needles = needles.reshape(n, -1)
    _, match, index = np.nonzero(np.all(
        haystack[:, None, :] == needles[:, :, None],
        axis=0, keepdims=True))
    result.reshape(m, -1)[:, match] = np.unravel_index(index, shape)
    if not keepdims and result.shape[0] == 1:
        result = np.squeeze(result, 0)
    return result

但两者都比使用 Numba JIT 加速的简单循环慢，例如：

import numpy as np
import numba as nb


def find_by_loop(haystack, needles):
    n, m = haystack.shape
    l, m_ = needles.shape
    result = np.full(l, -1)
    if m != m_:
        return result
    for i in range(l):
        for j in range(n):
            is_equal = True
            for k in range(m):
                if haystack[j, k] != needles[i, k]:
                    is_equal = False
                    break
            if is_equal:
                break
        if is_equal:
            result[i] = j
    return result


find_by_nb = nb.jit(find_by_loop)
find_by_nb.__name__ = 'find_by_nb'

虽然它们都给出相同的结果：

funcs = find_by_list, find_by_np, find_by_loop, find_by_nb


a = np.array([['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']])
b = np.array([['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']])
print(a.shape, b.shape)
for func in funcs:
    print(f'{func.__name__:>12s}(a, b) = {func(a, b)}')
# find_by_list(a, b) = [1 0 2 0 2]
#   find_by_np(a, b) = [1 0 2 0 2]
# find_by_loop(a, b) = [1 0 2 0 2]
#   find_by_nb(a, b) = [1 0 2 0 2]

时间安排如下：

print(f'({"n":<4s}, {"m":<4s}, {"k":<4s})', end='  ')
for func in funcs:
    print(f'{func.__name__:>15s}', end='    ')
print()
for n, m, k in itertools.product((5, 50, 500), repeat=3):
    a = np.random.randint(0, 100, (k, n))
    b = np.random.randint(0, 100, (m, n))
    print(f'({n:<4d}, {m:<4d}, {k:<4d})', end='  ')
    for func in funcs:
        result = %timeit -n3 -r10 -q -o func(a, b)
        print(f'{result.best * 1e3:12.3f} ms', end='    ')
    print()
# (n   , m   , k   )     find_by_list         find_by_np       find_by_loop         find_by_nb    
# (5   , 5   , 5   )         0.008 ms           0.048 ms           0.021 ms           0.001 ms    
# (5   , 5   , 50  )         0.018 ms           0.031 ms           0.176 ms           0.001 ms    
# (5   , 5   , 500 )         0.132 ms           0.092 ms           1.754 ms           0.006 ms    
# (5   , 50  , 5   )         0.065 ms           0.031 ms           0.184 ms           0.001 ms    
# (5   , 50  , 50  )         0.139 ms           0.093 ms           1.756 ms           0.006 ms    
# (5   , 50  , 500 )         1.096 ms           0.684 ms          17.546 ms           0.049 ms    
# (5   , 500 , 5   )         0.658 ms           0.093 ms           1.871 ms           0.006 ms    
# (5   , 500 , 50  )         1.383 ms           0.699 ms          17.504 ms           0.051 ms    
# (5   , 500 , 500 )         9.102 ms           7.752 ms         177.754 ms           0.491 ms    
# (50  , 5   , 5   )         0.026 ms           0.061 ms           0.022 ms           0.001 ms    
# (50  , 5   , 50  )         0.054 ms           0.042 ms           0.174 ms           0.002 ms    
# (50  , 5   , 500 )         0.356 ms           0.203 ms           1.759 ms           0.006 ms    
# (50  , 50  , 5   )         0.232 ms           0.042 ms           0.185 ms           0.001 ms    
# (50  , 50  , 50  )         0.331 ms           0.205 ms           1.744 ms           0.006 ms    
# (50  , 50  , 500 )         1.332 ms           2.422 ms          17.492 ms           0.051 ms    
# (50  , 500 , 5   )         2.328 ms           0.197 ms           1.882 ms           0.006 ms    
# (50  , 500 , 50  )         3.092 ms           2.405 ms          17.618 ms           0.052 ms    
# (50  , 500 , 500 )        11.088 ms          18.989 ms         175.568 ms           0.479 ms    
# (500 , 5   , 5   )         0.205 ms           0.035 ms           0.023 ms           0.001 ms    
# (500 , 5   , 50  )         0.410 ms           0.137 ms           0.187 ms           0.001 ms    
# (500 , 5   , 500 )         2.800 ms           1.914 ms           1.894 ms           0.006 ms    
# (500 , 50  , 5   )         1.868 ms           0.138 ms           0.201 ms           0.001 ms    
# (500 , 50  , 50  )         2.154 ms           1.814 ms           1.902 ms           0.006 ms    
# (500 , 50  , 500 )         6.352 ms          16.343 ms          19.108 ms           0.050 ms    
# (500 , 500 , 5   )        19.798 ms           1.957 ms           2.020 ms           0.006 ms    
# (500 , 500 , 50  )        20.922 ms          13.571 ms          18.850 ms           0.052 ms    
# (500 , 500 , 500 )        35.947 ms         139.923 ms         189.747 ms           0.481 ms

表明 Numba 提供了最快（并且 memory 效率最高）的解决方案，而其非 JIT 加速版本提供了最慢的解决方案。 基于 NumPy 的一种和基于list的一种以不同的速度出现在两者之间。 但是对于较大的输入，基于list的输入平均应该更快，因为它提供了更好的短路。

Answer 2

不是最优雅的解决方案，但它有效：

设置（将来，显示代码以生成您的示例，它将使其更快地回答）：

import numpy as np
a = np.array([['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']])
b = np.array([['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']])
desired_output = [1, 0, 2, 0, 2]

Using thenumpy.where function (as in this related question: Is there a NumPy function to return the first index of something in an array? )

我们对每一行中的每个项目使用np.where ，将 boolean 结果相乘，然后使用列表推导逐行传递：

output = [np.where((x[0]==a[:,0]) * (x[1]==a[:,1]))[0][0] for x in b]

它会返回您想要的结果。

Answer 3

也许是一种有趣的做事方式？

a.append(None)
aa = np.array(a)[:-1]                # Note 1

b.append(None)
bb = np.array(b)[:-1]

ind_arr = bb[:, None] == aa          # Note 2
np.nonzero(ind_arr)[1]

注 1 ：第一步更像是获取object类型一维数组的开销。 否则， numpy强制使用二维str类型的数组，这对这个应用程序没有帮助。 在这个答案中阅读更多相关信息。 它还说明了一些替代方案。

注意 2 ：这将创建一个二维 boolean 掩码，其中aa的每个元素与bb的每个元素进行比较以获得相等性，如下所示： ind_arr[i, j] = (bb[i] == aa[j]) 。
下一行使用此掩码并沿轴 1提取True值（比较已评估为True ）。 这是因为比较掩码中的aa值沿轴 1。
另一个讨论以更好地理解这一点。

但是，如果您正在寻找速度，对于lists ，norok2 的答案要快得多。 这或许，可以有创新的应用。 干杯!

转换为 numpy 中的索引数组

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-05-21 10:50:15

解决方案2
0 2020-05-21 10:43:45

解决方案3
0 2020-05-21 13:37:10

转换为 numpy 中的索引数组

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-05-21 10:50:15

解决方案2 0 2020-05-21 10:43:45

解决方案3 0 2020-05-21 13:37:10

解决方案1
1 已采纳 2020-05-21 10:50:15

解决方案2
0 2020-05-21 10:43:45

解决方案3
0 2020-05-21 13:37:10