简体   繁体   English

Numpy 批点积

[英]Numpy batch dot product

Suppose I have two vectors and wish to take their dot product;假设我有两个向量并希望取它们的点积; this is simple,这很简单,

import numpy as np

a = np.random.rand(3)
b = np.random.rand(3)

result = np.dot(a,b)

If I have stacks of vectors and I want each one dotted, the most naive code is如果我有一堆向量并且我希望每个向量都有点,那么最天真的代码是

# 5 = number of vectors
a = np.random.rand(5,3)
b = np.random.rand(5,3)
result = [np.dot(aa,bb) for aa, bb in zip(a,b)]

Two ways to batch this computation are using a multiply and sum, and einsum,批量计算的两种方法是使用乘法和求和,以及 einsum,

result = np.sum(a*b, axis=1)

# or
result = np.einsum('ij,ij->i', a, b)

However, neither of these dispatch to the BLAS backend, and so use only a single core.但是,这些都没有发送到 BLAS 后端,因此只使用一个内核。 This is not super great when N is very large, say 1 million.N很大时,比如 100 万,这并不是很好。

tensordot does dispatch to the BLAS backend. tensordot确实调度到 BLAS 后端。 A terrible way to do this computation with tensordot is使用 tensordot 进行此计算的一种糟糕方法是

np.diag(np.tensordot(a,b, axes=[1,1])

This is terrible because it allocates an N*N matrix, and the majority of the elements are waste work.这很糟糕,因为它分配了一个N*N矩阵,并且大部分元素都是浪费工作。

Another (brilliantly fast) approach is the hidden inner1d function另一种(非常快)方法是隐藏的 inner1d function

from numpy.core.umath_tests import inner1d

result = inner1d(a,b)

but it seems this isn't going to be viable , since the issue that might export it publicly has gone stale.但这似乎不可行,因为可能公开导出它的问题已经过时了。 And this still boils down to writing the loop in C, instead of using multiple cores.这仍然归结为在 C 中编写循环,而不是使用多个内核。

Is there a way to get dot , matmul , or tensordot to do all these dot products at once, on multiple cores?有没有办法让dotmatmultensordot在多个内核上一次完成所有这些点积?

First of all, there is no direct BLAS function to do that .首先,没有直接的BLAS function可以做到这一点 Using many level 1 BLAS functions is not very efficient since using multiple thread for a very short-timed computation tends to introduce a pretty-big overhead and not using multiple thread may be sub-optimal.使用许多 1 级 BLAS 函数效率不高,因为使用多线程进行非常短时间的计算往往会引入相当大的开销,并且不使用多线程可能不是最佳选择。 Still, such computation is mainly memory-bound and so it scales poorly on platform with many cores (few cores are often enough to saturate the memory bandwidth).尽管如此,这种计算主要是受内存限制的,因此它在具有许多内核的平台上扩展性很差(很少几个内核通常足以使 memory 带宽饱和)。

One simple solution is to use the Numexpr package which should do that quite efficiently (it should avoid the creation of temporary arrays and should also use multiple threads).一个简单的解决方案是使用Numexpr package 应该非常有效地做到这一点(它应该避免创建临时 arrays 并且还应该使用多个线程)。 However, the performance are somewhat disappointing for big array in this case.但是,在这种情况下,大阵列的性能有些令人失望。

The best solution appear to use Numba (or Cython).最好的解决方案似乎是使用Numba (或 Cython)。 Numba can generate a fast code for both small and big input arrays and it is easy to parallelize the code. Numba 可以为小型和大型输入 arrays 生成快速代码,并且很容易并行化代码。 Please note however that managing threads introduces an overhead that can be quite big for small array (up to few ms on some many-core platforms).但是请注意,管理线程会引入一个开销,这对于小型阵列来说可能相当大(在某些多核平台上最多几毫秒)。

Here is a Numexpr implementation:这是一个 Numexpr 实现:

import numexpr as ne
expr = ne.NumExpr('sum(a * b, axis=1)')
result = expr.run(a, b)

Here is a (sequential) Numba implementation:这是一个(顺序)Numba 实现:

import numba as nb

# Use `parallel=True` for a parallel implementation
@nb.njit('float64[:](float64[:,::1], float64[:,::1])')
def multiDots(a, b):
    assert a.shape == b.shape
    n, m = a.shape
    res = np.empty(n, dtype=np.float64)

    # Use `nb.prange` instead of `range` to run the loop in parallel
    for i in range(n):
        s = 0.0
        for j in range(m):
            s += a[i,j] * b[i,j]
        res[i] = s

    return res

result = multiDots(a, b)

Here are some benchmarks on a (old) 2-core machine:以下是(旧)2 核机器上的一些基准测试:

On small 5x3 arrays:
    np.einsum('ij,ij->i', a, b, optimize=True):  45.2 us
    Numba (parallel):                            12.1 us
    np.sum(a*b, axis=1):                          9.5 us
    np.einsum('ij,ij->i', a, b):                  6.5 us
    Numexpr:                                      3.2 us
    Numba (sequential):                           1.3 us

On small 1000000x3 arrays:
    np.sum(a*b, axis=1):                         27.8 ms
    Numexpr:                                     15.3 ms
    np.einsum('ij,ij->i', a, b, optimize=True):   9.0 ms
    np.einsum('ij,ij->i', a, b):                  8.8 ms
    Numba (sequential):                           6.8 ms
    Numba (parallel):                             5.3 ms

The sequential Numba implementation gives a good trade-off.顺序 Numba 实现提供了很好的权衡。 You can use a switch if you really want the best performance.如果你真的想要最好的性能,你可以使用开关。 Choosing the best n threshold in a platform-independent way is not so easy though.但是,以独立于平台的方式选择最佳n阈值并不是那么容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM