将N-dim数组广播到（N + 1）-dim数组并在除1个dim之外的所有值上求和

Question

Assume you have a numpy array with shape (a,b,c) and a boolean mask of shape (a,b,c,d). 假设您有一个形状为（a，b，c）的numpy数组和形状为（a，b，c，d）的布尔掩码。 I would like to apply the mask to the array iterating over the last axis, sum the masked array along the first three axes, and obtain a list (or an array) of length/shape (d,). 我想将掩码应用于在最后一个轴上迭代的数组，将掩码的数组沿前三个轴求和，并获得长度/形状（d，）的列表（或数组）。 I tried to do this with a list comprehension: 我试图通过列表理解来做到这一点：

Result = [np.sum(Array[Mask[:,:,:,i]], axis=(0,1,2)) for i in range(d)]

It works, but it does not look very pythonic and it is a bit slow as well. 它可以工作，但是看起来不是很pythonic，也有点慢。 I also tried something like 我也尝试过类似的东西

Array = Array[:,:,:,np.newaxis]
Result = np.sum(Array[Mask], axis=(0,1,2))

but of course this doesn't work, since the dimension of the Mask along the last axis, d, is larger than the dimension of the last axis of the Array, 1. Also, consider that each axis could have dimension of order 100 or 200, so repeating the Array d times along a new last axis using np.repeat would be really memory intensive, and I would like to avoid this. 但这当然是行不通的，因为Mask沿最后一个轴d的尺寸大于Array的最后一个轴1的尺寸。另外，请注意，每个轴的尺寸可以为100或200，因此使用np.repeat沿着新的最后一个轴重复Array d次确实会占用大量内存，我想避免这种情况。 Are there any other faster and more pythonic alternatives to the list comprehension? 列表推导还有其他更快，更Python的替代方法吗？

Answer 1

How about 怎么样

Array.reshape(-1)@Mask.reshape(-1,d)

Since you are summing over the first three axes anyway you may as well merge them after which it is easy to see that the operation can be written as matrix-vector product 由于无论如何都要对前三个轴求和，因此也可以将它们合并，然后很容易看出该运算可以写成矩阵向量乘积

Example: 例：

a,b,c,d = 4,5,6,7
Mask = np.random.randint(0,2,(a,b,c,d),bool)
Array = np.random.randint(0,10,(a,b,c))
[np.sum(Array[Mask[:,:,:,i]]) for i in range(d)]
# [310, 237, 253, 261, 229, 268, 184]    
Array.reshape(-1)@Mask.reshape(-1,d)
# array([310, 237, 253, 261, 229, 268, 184])

Answer 2

The most straightforward way of broadcasting a N-dimensional array to a matching (N+1)-dimensional array is to use np.broadcast_to() : 将N维数组广播为匹配（N + 1）维数组的最直接方法是使用np.broadcast_to() ：

import numpy as np


arr = np.random.randint(0, 100, (2, 3))
mask = np.random.randint(0, 2, (2, 3, 4), dtype=bool)
b_arr = np.broadcast_to(arr[..., None], mask.shape)
print(mask.shape == b_arr.shape)
# True

However, as @hpaulj already pointed out, you cannot use mask for slicing b_arr without loosing the dimensions. 但是，正如@hpaulj所指出的那样，您不能在不损失尺寸的情况下使用mask来切片b_arr 。

Given that you want to just sum the elements together and summing zeroes "does not hurt", you could simply multiply element-wise your array and your mask so as to keep the correct dimension but the elements that are False in the mask are irrelevant for the subsequent sum of the corresponding array elements: 假设您只想将元素加在一起，然后将零加起来“不会有伤害”，则可以简单地按元素对数组和掩码进行乘法运算，以保持正确的尺寸，但是掩码中为False的元素对于相应数组元素的后续sum ：

result = np.sum(b_arr * mask, axis=tuple(range(mask.ndim - 1)))

or, since * will do the broadcasting automatically: 或者，因为*将自动进行广播：

result = np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))

without the need to use np.broadcast_to() in the first place (but you still need to match the number of dimension, ie using arr[..., None] and not just arr ). 无需首先使用np.broadcast_to() （但您仍然需要匹配维数，即使用arr[..., None]而不只是arr ）。

As @PaulPanzer already pointed out , since you want to sum up over all but one dimensions, this can be further simplified using np.matmul() / @ : 正如@PaulPanzer 已经指出的那样，由于您想对除一个维度之外的所有维度进行np.matmul() ，因此可以使用np.matmul() / @进一步简化：

result2 = arr.ravel() @ mask.reshape(-1, mask.shape[-1])
print(np.all(result == result2))
# True

For fancier operations involving the summation, please have a look at np.einsum() . 有关涉及求和的更高级操作，请查看np.einsum() 。

EDIT 编辑

The catch with broadcasting is that it will create temporary arrays during the evaluation of your expressions. 广播的问题是它将在评估表达式时创建临时数组。

With the number you seems to be dealing with, I simply cannot use the broadcasted arrays as I run into MemoryError , but time-wise the element-wise multiplication may still be a better approach than what you originally proposed. 使用您似乎要处理的数字，当遇到MemoryError时，我根本无法使用广播的数组，但是在时间上逐元素相乘可能仍然是比您最初建议的更好的方法。

Alternatively, if you are after speed, you could do this at a somewhat lower level with explicit looping in Cython or Numba. 另外，如果您追求速度，则可以在Cython或Numba中显式循环，从而在较低的级别上执行此操作。

Below you can find a couple of Numba-based solutions (working on ravel() -ed data): 在下面，您可以找到几个基于Numba的解决方案（使用ravel ravel() ed数据）：

_vector_matrix_product() : does not use any temporary array _vector_matrix_product() ：不使用任何临时数组
_vector_matrix_product_mp() : some as above but using parallel execution _vector_matrix_product_mp() ：与上面相同，但使用并行执行
_vector_matrix_product_sum() : uses np.sum() and parallel execution _vector_matrix_product_sum() ：使用np.sum()和并行执行

import numpy as np
import numba as nb


@nb.jit(nopython=True)
def _vector_matrix_product(
        vect_arr,
        mat_arr,
        result_arr):
    rows, cols = mat_arr.shape
    if vect_arr.shape == result_arr.shape:
        for i in range(rows):
            for j in range(cols):
                result_arr[i] += vect_arr[j] * mat_arr[i, j]
    else:
        for i in range(rows):
            for j in range(cols):            
                result_arr[j] += vect_arr[i] * mat_arr[i, j]


@nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_mp(
        vect_arr,
        mat_arr,
        result_arr):
    rows, cols = mat_arr.shape
    if vect_arr.shape == result_arr.shape:
        for i in nb.prange(rows):
            for j in nb.prange(cols):
                result_arr[i] += vect_arr[j] * mat_arr[i, j]
    else:
        for i in nb.prange(rows):
            for j in nb.prange(cols):        
                result_arr[j] += vect_arr[i] * mat_arr[i, j]


@nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_sum(
        vect_arr,
        mat_arr,
        result_arr):
    rows, cols = mat_arr.shape
    if vect_arr.shape == result_arr.shape:
        for i in nb.prange(rows):
            result_arr[i] = np.sum(vect_arr * mat_arr[i, :])
    else:
        for j in nb.prange(cols):
            result_arr[j] = np.sum(vect_arr * mat_arr[:, j])


def vector_matrix_product(
        vect_arr,
        mat_arr,
        swap=False,
        dtype=None,
        mode=None):
    rows, cols = mat_arr.shape
    if not dtype:
        dtype = (vect_arr[0] * mat_arr[0, 0]).dtype
    if not swap:
        result_arr = np.zeros(cols, dtype=dtype)
    else:
        result_arr = np.zeros(rows, dtype=dtype)
    if mode == 'sum':
        _vector_matrix_product_sum(vect_arr, mat_arr, result_arr)
    elif mode == 'mp':
        _vector_matrix_product_mp(vect_arr, mat_arr, result_arr)
    else:
        _vector_matrix_product(vect_arr, mat_arr, result_arr)
    return result_arr


np.random.seed(0)
arr = np.random.randint(0, 100, (2, 3, 4))
mask = np.random.randint(0, 2, (2, 3, 4, 5), dtype=bool)
target = arr.ravel() @ mask.reshape(-1, mask.shape[-1])
print(target)
# [820 723 861 486 408]
result1 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
print(result1)
# [820 723 861 486 408]
result2 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
print(result2)
# [820 723 861 486 408]
result3 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
print(result3)
# [820 723 861 486 408]

with improved timing over any list -comprehension-based solutions: 与任何基于list的基于解决方案的解决方案相比，时序得到了改善

arr = np.random.randint(0, 100, (256, 256, 256))
mask = np.random.randint(0, 2, (256, 256, 256, 128), dtype=bool)


%timeit np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
# MemoryError

%timeit arr.ravel() @ mask.reshape(-1, mask.shape[-1])
# MemoryError

%timeit np.array([np.sum(arr * mask[..., i], axis=tuple(range(mask.ndim - 1))) for i in range(mask.shape[-1])])
# 24.1 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array([np.sum(arr[mask[..., i]]) for i in range(mask.shape[-1])])
# 46 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
# 408 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
# 1.63 s ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
# 7.17 s ± 258 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As expected, the JIT accelerated version is the fastest, and enforcing parallelism on the code does not result in improved speed-ups. 不出所料，JIT加速版本是最快的，并且在代码上强制执行并行性不会提高速度。 Note also that the approach with element-wise multiplication is faster than slicing (approx. twice as fast for these benchmarks). 还要注意，采用元素逐次乘法的方法比切片要快（这些基准的速度大约是切片的两倍）。

EDIT 2 编辑2

Following @max9111 suggestion, looping first by rows and then by cols cause the most time-consuming loop to run on contiguous data, resulting in significant speed-up. 按照@ max9111的建议，首先按行循环，然后按cols循环，使最耗时的循环对连续数据运行，从而显着提高了速度。 Without this trick, _vector_matrix_product_sum() and _vector_matrix_product_mp() would run at essentially the same speed. 没有这个技巧， _vector_matrix_product_sum()和_vector_matrix_product_mp()将会以基本上相同的速度运行。

将N-dim数组广播到（N + 1）-dim数组并在除1个dim之外的所有值上求和

问题描述

2 个解决方案

解决方案1
1 2019-09-18 00:27:29

解决方案2
1 已采纳 2019-09-18 01:36:35

EDIT 编辑

EDIT 2 编辑2

将N-dim数组广播到（N + 1）-dim数组并在除1个dim之外的所有值上求和

问题描述

2 个解决方案

解决方案1 1 2019-09-18 00:27:29

解决方案2 1 已采纳 2019-09-18 01:36:35

EDIT 编辑

EDIT 2 编辑2

解决方案1
1 2019-09-18 00:27:29

解决方案2
1 已采纳 2019-09-18 01:36:35