简体   繁体   English

Numpy - 如何将矢量索引数组转换为蒙版?

[英]Numpy - how to convert an array of vector indices to a mask?

Given a np.ndarray named indices with a n rows and variable length vector in each row I want to create a boolean mask of n rows and m rows where m is a pre-known value equal to the greatest value possible in indices . 给定np.ndarray命名的indices ,每行有n行和可变长度向量,我想创建一个n行和m行的布尔掩码,其中m是一个预先知道的值,等于indices可能的最大值。 Take note that the indices specified in indices refer to per-row indices, and not global matrix indices. 请注意,索引中指定的indices是指每行索引,而不是全局矩阵索引。

For example, given: 例如,给定:

indices = np.array([
    [2, 0],
    [0],
    [4, 7, 1]
])

# Expected output
print(mask)
[[ True False  True False False False False False]
 [ True False False False False False False False]
 [False  True False False  True False False  True]]

m is known beforehand (the maximum length of each row in mask ) and doesn't need to be inferred from indices m是事先已知的( mask中每行的最大长度),不需要从indices推断出来

Notice : This is different from converting an array of indices to a mask where the indices refer to the resulting matrix indices 注意 :这与将索引数组转换为掩码(其中索引引用结果矩阵索引)不同

Here's one way - 这是一种方式 -

def mask_from_indices(indices, ncols=None):
    # Extract column indices
    col_idx = np.concatenate(indices)

    # If number of cols is not given, infer it based on max column index
    if ncols is None:
        ncols = col_idx.max()+1

    # Length of indices, to be used as no. of rows in o/p
    n = len(indices)

    # Initialize o/p array
    out = np.zeros((n,ncols), dtype=bool)

    # Lengths of each index element that represents each group of col indices
    lens = np.array(list(map(len,indices)))

    # Use np.repeat to generate all row indices
    row_idx = np.repeat(np.arange(len(lens)),lens)

    # Finally use row, col indices to set True values
    out[row_idx,col_idx] = 1
    return out    

Sample run - 样品运行 -

In [89]: mask_from_indices(indices)
Out[89]: 
array([[ True, False,  True, False, False, False, False, False],
       [ True, False, False, False, False, False, False, False],
       [False,  True, False, False,  True, False, False,  True]])

Here is a variant: 这是一个变种:

def create_mask(indices, m):
    mask = np.zeros((len(indices), m), dtype=bool)
    for i, idx in enumerate(indices):
        mask[i, idx] = True
    return mask

Usage: 用法:

>>> create_mask(indices, 8)
array([[ True, False,  True, False, False, False, False, False],
       [ True, False, False, False, False, False, False, False],
       [False,  True, False, False,  True, False, False,  True]])

While there is no direct way of doing this in a fully vectorized way, for larger inputs, a single application of mask[full_row_indices, full_col_indices] with the pre-computed full list of indices is faster than multiple applications of mask[partial_row_indices, partial_col_indices] . 虽然没有直接的方法以完全矢量化的方式执行此操作,但对于较大的输入,使用预先计算的完整索引列表的mask[full_row_indices, full_col_indices]的单个应用程序比mask[partial_row_indices, partial_col_indices]多个应用程序mask[partial_row_indices, partial_col_indices]更快。 Memory-wise, the multiple applications are also less demanding because no intermediate full_row_indices / full_col_indices need to be built. 在内存方面,多个应用程序的要求也较低,因为不需要full_row_indices中间的full_row_indices / full_col_indices Of course this would generally depend on the length of indices . 当然,这通常取决于indices的长度。

Just to get some feeling on how much faster the different possible solutions could, the following functions have been tested: 只是为了了解不同可能的解决方案可以更快的速度,已经测试了以下功能:

import numpy as np
import random


def gen_mask_direct(col_indices, cols=None):
    if cols is None:
        cols = np.max(np.concatenate(col_indices)) + 1
    rows = len(col_indices)
    mask = np.zeros((rows, cols), dtype=bool)
    for row_index, col_index in enumerate(col_indices):
        mask[row_index, col_index] = True
    return mask 


def gen_mask_loops(col_indices, cols=None):
    rows = len(col_indices)
    row_indices = tuple(i for i, j in enumerate(col_indices) for _ in j)
    col_indices = tuple(sum(col_indices, ()))
    if cols is None:
        cols = np.max(col_indices) + 1
    mask = np.zeros((rows, cols), dtype=bool)
    mask[row_indices, col_indices] = True
    return mask


def gen_mask_np_repeat(col_indices, cols=None):
    rows = len(col_indices)
    lengths = list(map(len, col_indices))
    row_indices = np.repeat(np.arange(rows), lengths)
    col_indices = np.concatenate(col_indices)
    if cols is None:
        cols = np.max(col_indices) + 1
    mask = np.zeros((rows, cols), dtype=bool)
    mask[row_indices, col_indices] = True
    return mask


def gen_mask_np_concatenate(col_indices, cols=None):
    rows = len(col_indices)
    row_indices = tuple(np.full(len(col_index), i) for i, col_index in enumerate(col_indices))
    row_indices = np.concatenate(row_indices)
    col_indices = np.concatenate(col_indices)
    if cols is None:
        cols = np.max(col_indices) + 1
    mask = np.zeros((rows, cols), dtype=bool)
    mask[row_indices, col_indices] = True
    return mask

gen_mask_direct() is basically @Derlin answer and implements multiple applications of mask[partial_row_indices, partial_col_indices] . gen_mask_direct()基本上是@Derlin的答案并实现了mask[partial_row_indices, partial_col_indices]多个应用程序。 All the others implement a single application of mask[full_row_indices, full_col_indices] with different ways of preparing the full_row_indices and the full_col_indices : 所有其他实现了mask[full_row_indices, full_col_indices]的单个应用程序mask[full_row_indices, full_col_indices]包含准备full_row_indicesfull_col_indices不同方法:

  • gen_mask_loops() uses direct looping gen_mask_loops()使用直接循环
  • gen_mask_np_repeat() uses np.repeat() (and it is substantially the same as @Divakar answer gen_mask_np_repeat()使用np.repeat() (和@Divakar答案基本相同)
  • gen_mask_np_concatenate() uses a combination of np.full() and np.concatenate() gen_mask_np_concatenate()使用np.full()np.concatenate()

A quick sanity check indicates that all these are equivalent: 快速完整性检查表明所有这些都是等效的:

funcs = gen_mask_direct, gen_mask_loops, gen_mask_np_repeat, gen_mask_np_concatenate

random.seed(0)
test_inputs = [
    (tuple(
        tuple(sorted(set([random.randint(0, n - 1) for _ in range(random.randint(1, n - 1))])))
                for _ in range(random.randint(1, n - 1))))
    for n in range(5, 6)
    ]
print(test_inputs)
# [((0, 2, 3, 4), (2, 3, 4), (1, 4), (0, 1, 4))]

for func in funcs:
    print('Func:', func.__name__)
    for test_input in test_inputs:    
        print(func(test_input).astype(int))
Func: gen_mask_direct
[[1 0 1 1 1]
 [0 0 1 1 1]
 [0 1 0 0 1]
 [1 1 0 0 1]]
Func: gen_mask_loops
[[1 0 1 1 1]
 [0 0 1 1 1]
 [0 1 0 0 1]
 [1 1 0 0 1]]
Func: gen_mask_np_repeat
[[1 0 1 1 1]
 [0 0 1 1 1]
 [0 1 0 0 1]
 [1 1 0 0 1]]
Func: gen_mask_np_concatenate
[[1 0 1 1 1]
 [0 0 1 1 1]
 [0 1 0 0 1]
 [1 1 0 0 1]]

Here are some benchmarks (using the code from here ): 以下是一些基准测试(使用此处的代码):

benchmark_full

and zooming to the fastest: 并缩放到最快:

benchmark_zoom

supporting the overall statement that, typically, a single application of mask[...] for full indices is faster multiple applications of mask[...] for partial indices. 支持整体声明,通常,对于完整索引,单个应用mask[...]可以更快地为部分索引多次应用mask[...]


For completeness, the following code was used to generate the inputs, compare the outputs, run the benchmarks and prepare the plots: 为了完整起见,使用以下代码生成输入,比较输出,运行基准并准备图:

def gen_input(n):
    random.seed(0)
    return tuple(
        tuple(sorted(set([random.randint(0, n - 1) for _ in range(random.randint(n // 2, n - 1))])))
        for _ in range(random.randint(n // 2, n - 1)))


def equal_output(a, b):
    return np.all(a == b)


input_sizes = tuple(int(2 ** (2 + (3 * i) / 4)) for i in range(13))
print('Input Sizes:\n', input_sizes, '\n')


runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)


plot_benchmarks(runtimes, input_sizes, labels, units='ms')
plot_benchmarks(runtimes, input_sizes, labels, units='ms', zoom_fastest=2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM