即使对于巨型矩阵，NUMBA CUDA 也比并行 CPU 慢

Question

There are only a few examples online on using cuda for numba and I find them all to be slower than the parallel CPU method.网上只有几个关于使用 cuda 进行 numba 的示例，我发现它们都比并行 CPU 方法慢。 Vectorise with CUDA target and stencils are even worse so I tried to create a custom kernel.使用 CUDA 目标和模板进行矢量化甚至更糟，所以我尝试创建自定义 kernel。 The one blogpost you find everywhere is https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c .您到处都能找到的一篇博文是https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c 。 This example is a simple blur filter:这个例子是一个简单的模糊过滤器：

import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda


@numba.cuda.jit
def smooth_gpu(x, out):
    i, j = cuda.grid(2)
    n, m = x.shape

    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                    x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                    x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

x_gpu = np.ones((10000, 10000), dtype='float32')
out_gpu = np.zeros((10000, 10000), dtype='float32')

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
print("GPU Time: {0:1.6f}s ".format(time.time() - start_time))

and the CPU version:和 CPU 版本：

x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')


@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,np.shape(x)[0]-1):
        for j in range(1,np.shape(x)[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: {0:1.6f}s ".format(time.time() - start_time))

I get ~500 ms for the GPU version and 50 ms for the CPU one. GPU 版本大约为 500 毫秒，CPU 版本为 50 毫秒。 What is going on?到底是怎么回事？

Answer 1

There are 2 things I would point out:我要指出两点：

You are including in your timing of the GPU version the time it takes to transfer the input array from host to device, and the results from device to host.您在 GPU 版本的时序中包括将输入数组从主机传输到设备所需的时间，以及从设备到主机的结果。 If this is the intent of your comparison, then so be it;如果这是您比较的目的，那就这样吧； the conclusion is the GPU is not suited to this task (in an interesting way).结论是 GPU 不适合这项任务（以一种有趣的方式）。
The GPU code while giving correct results is not organized for good performance. GPU 代码虽然给出了正确的结果，但并不是为了获得良好的性能而组织起来的。 The issue lies here:问题出在这里：
```
 i, j = cuda.grid(2)
```
coupled with the order those indices are being used to access data:再加上这些索引用于访问数据的顺序：
```
 out[i, j] = (x[i - 1, j - 1]...
```
this results in inefficient access in the GPU.这会导致 GPU 的访问效率低下。 We can fix that by reversing one of the two orders depicted above.我们可以通过颠倒上面描述的两个顺序之一来解决这个问题。

Here are your codes tweaked slightly taking into account both issues above:考虑到上述两个问题，以下是您的代码稍作调整：

$ cat t29a.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda


x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')


@njit(parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,x.shape[0]-1):
        for j in range(1,x.shape[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: {0:1.6f}s ".format(time.time() - start_time))
$ python t29a.py
CPU Time: 0.161944s

$ cat t29.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda
import math

@numba.cuda.jit
def smooth_gpu(x, out):
    j, i = cuda.grid(2)
    m, n = x.shape

    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                    x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                    x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

x = np.ones((10000, 10000), dtype='float32')
out = np.zeros((10000, 10000), dtype='float32')
x_gpu = cuda.to_device(x)
out_gpu = cuda.device_array_like(out)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
cuda.synchronize()
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
cuda.synchronize()
print("GPU Time: {0:1.6f}s ".format(time.time() - start_time))
$ python t29.py
GPU Time: 0.021776s
$

So we see if we adjust for both issues indicated, the GPU (a GTX 960 in my case) is about 8x faster than the CPU.所以我们看看如果我们针对两个问题进行调整，GPU（在我的例子中是 GTX 960）比 CPU 快大约 8 倍。 Such measurements depend to some degree on the CPU and GPU used for comparison -- you shouldn't assume my measurements are comparable to yours -- its better for you to run these modified codes for comparison.这样的测量在某种程度上取决于用于比较的 CPU 和 GPU ——你不应该假设我的测量与你的测量相当——最好运行这些修改后的代码进行比较。 However, the data transfer time certainly exceeds the GPU computation time by a sizeable margin, and in my case also exceeds the CPU computation time.但是，数据传输时间肯定比 GPU 的计算时间要大很多，而且在我的情况下也超过了 CPU 的计算时间。 This means (in my case at least, not a particularly fast system in any respect) that even if we reduced the GPU computation time to zero, the cost to transfer the data would still exceed the CPU computation time cost.这意味着（至少在我的情况下，在任何方面都不是特别快的系统）即使我们将 GPU 计算时间减少到零，传输数据的成本仍然会超过 CPU 计算时间成本。

Therefore, when you run into such a situation, it's impossible to win.因此，当你遇到这种情况时，是不可能获胜的。 About the only advice that can be given then is "don't do that", ie find a more interesting and more complex problem for the GPU to solve.那时可以给出的唯一建议是“不要那样做”，即找到一个更有趣、更复杂的问题让 GPU 解决。 If we make the problem very simple computationally, such as this one, or such as vector add, and that is the only thing you want to do on the GPU, it is almost never an interesting comparison to doing it on the CPU.如果我们在计算上使问题变得非常简单，比如这个问题，或者向量加法，这是你唯一想在 GPU 上做的事情，与在 CPU 上做的比较几乎从来都不是一个有趣的比较。 Hopefully you can see that making the matrix bigger doesn't help much here, because it also impacts the data transfer time/cost.希望您能看到让矩阵变大在这里并没有太大帮助，因为它也会影响数据传输时间/成本。

If we factor out the data transfer cost (and don't make performance crippling mistakes in our GPU code), according to my testing the GPU is faster than the CPU.如果我们考虑到数据传输成本（并且不要在我们的 GPU 代码中犯下严重的性能错误），根据我的测试，GPU 比 CPU 快。 If we include the data transfer cost, for this very simple problem, its quite possible that there is no way the GPU can be faster than the CPU (even if the GPU computation time were reduced to zero).如果我们包括数据传输成本，对于这个非常简单的问题，很可能 GPU 不可能比 CPU 更快（即使 GPU 计算时间减少到零）。

There is no doubt that more could be done to slightly improve the GPU case (eg change the block shape, use shared memory, etc.), but I personally don't wish to spend my time polishing uninteresting things.毫无疑问，GPU 外壳可以做更多的改进（例如改变方块形状，使用共享 memory 等），但我个人不希望花时间打磨无趣的东西。

You can get additional description of Numba GPU memory management here .您可以在此处获得 Numba GPU memory 管理的更多描述。

A general description of the memory efficiency issue related to index ordering is here与索引排序相关的 memory 效率问题的一般描述在这里

Answer 2

I found this comparison interesting, and wanted to look into the impact of reusing compiled kernels, cuda streams, and randomized data to ensure no fancy compiler optimizations were skewing what we saw.我发现这个比较很有趣，并想研究重用编译内核、cuda 流和随机数据的影响，以确保没有花哨的编译器优化扭曲我们所看到的。

I modified the code sample posted by Robert Crovella and ran the script on a modest ML rig at school:我修改了 Robert Crovella 发布的代码示例，并在学校的一个适度的 ML 装备上运行了脚本：

Code代码

import numpy as np
from time import perf_counter
from numba import njit, prange,cuda

# cpuinfo is a third party package from here:
#   https://github.com/workhorsy/py-cpuinfo
# or you can just install it using pip with:
#   python -m pip install -U py-cpuinfo
from cpuinfo import get_cpu_info

print("Some diagnostic info for the system running this script:")
# prints information about the cuda GPU
cuda.detect()
print()
# Prints a json string describing the cpu
s = get_cpu_info()
print("Cpu info")
for k,v in s.items():
    print(f"\t{k}: {v}")
print()

cpu_s1 = "CPU execution time:"
cpu_s2 = "CPU full setup/execution time:"
gpu_s1 = "GPU kernel execution time:"
gpu_s2 = "GPU full kernel setup/execution time:"
l = len(gpu_s2) + 1
# using randomized floats to ensure there isn't some compiler optimization that
# recognizes that all values of the x array are constant 1's and does something
# goofy under the hood. Each timing scenario will then use a copy of this array.
common_x = np.random.random((10000, 10000)).astype(np.float32)

def time_njit(n_loops=2):
    start_time_full_function = perf_counter()

    @njit(parallel=True,nogil=True)
    def smooth_cpu(x, out):
        h,w = x.shape
        for i in prange(1,h-1):
            for j in range(1,w-1):
                out[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] +
                                  x[i - 1, j + 1] + x[i    , j - 1] +
                                  x[i    , j]     + x[i    , j + 1] +
                                  x[i + 1, j - 1] + x[i + 1, j] +
                                  x[i + 1, j + 1]) / 9


    pre_x = np.ones((10,10),dtype=common_x.dtype)
    pre_out = np.ones((10,10),dtype=common_x.dtype)
    _x = common_x.copy()
    _out = np.zeros_like(_x)
    # run on cpu
    smooth_cpu(pre_x, pre_out) # compile before measuring time
    start_time = perf_counter()
    for _ in range(n_loops):
        # realistically, we wouldn't typically run just a single blurring pass
        smooth_cpu(_x, _out)
        smooth_cpu(_out,_x)
    end_time = perf_counter()
    end_time_full_function = perf_counter()
    print(f"{cpu_s1:<{l}} {end_time - start_time:1.6f}s running {n_loops} loops"
          f"\n{cpu_s2:<{l}} {end_time_full_function - start_time_full_function:1.6f}s")
    return _x



def time_cuda(n_loops=2):
    """There is room for optimization in how we use cuda.shared.array memory on the GPU
    -- where I'm not aware of any analogues tricks for the cpu function -- that would
    allow us to minimize the number of times each thread-block needs to access data in
    the GPU's global memory. But such an implementation would take us deeper into the
    weeds than this toy problem calls for.

    Maybe if I need to take a break from my other work later I'll come back to this
    and flesh out an example of what I mean.
    """
    start_time_full_function = perf_counter()
    @cuda.jit
    def smooth_gpu(x, out):
        """slight change to the cuda kernel. This version uses **striding** to reduce
        processor overhead spent allocating and deallocating a lot of thread blocks
        that ultimately have each thread compute a single calculation before being
        disposed of.

        This way we offset some of the overhead cost spent on block allocation by
         making each block do a bit more work.

        Note: For this to work right, we have to allocate fewer blocks with
              our `blockspergrid_j` and `blockspergrid_i` variables.
        """
        jstart, istart = cuda.grid(2)
        jstep, istep = cuda.gridsize(2)
        rows,cols = x.shape
        # note that for strided kernels, thread indices
        # are completely independent of the data size/shape
        for i in range(istart+1,rows-1,istep):
            for j in range(jstart+1,cols-1,jstep):
                # Because we created x and out using column-major memory ordering,
                # we want to make sure the most frequently changing index (j)
                # is iterating through the last dimension of the array.
                out[i,j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                            x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                            x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

    _x = common_x.copy()
    _out = np.zeros_like(_x)
    stream = cuda.stream()
    x_gpu = cuda.to_device(_x,stream)
    out_gpu = cuda.to_device(_out,stream)
    tpbj = 16
    tpbi = 16
    threadsperblock = tpbj,tpbi
    blockspergrid_j = (_x.shape[0]+tpbj-1) // tpbj
    blockspergrid_i = (_x.shape[1]+tpbi-1) // tpbi
    # reduce the number of blocks in each axis
    # by a quarter to give room for striding
    blockspergrid = (blockspergrid_j//4, blockspergrid_i//4)
    # run on gpu
    compiled = smooth_gpu[blockspergrid, threadsperblock, stream] # compile before measuring time

    start_time = perf_counter()
    for _ in range(n_loops):
        # realistically, we wouldn't typically run just a single blurring pass
        compiled(x_gpu, out_gpu)
        compiled(out_gpu,x_gpu)
    x_gpu.copy_to_host(_out,stream)
    stream.synchronize()
    end_time = perf_counter()
    end_time_full_function = perf_counter()
    print(f"{gpu_s1:<{l}} {end_time-start_time:1.6f}s running {n_loops} loops"
          f"\n{gpu_s2:<{l}} {end_time_full_function-start_time_full_function:1.6f}s")
    return _out

if __name__ == '__main__':
    a = time_njit(1)
    b = time_cuda(1)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"{'    '*4}Outputs are equivalent")
    a = time_njit(5)
    b = time_cuda(5)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"{'    '*4}Results are equivalent")
    a = time_njit(10)
    b = time_cuda(10)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"{'    '*4}Results are equivalent")
    a = time_njit(20)
    b = time_cuda(20)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"{'    '*4}Results are equivalent")

Output: Output：

Some diagnostic info for the system running this script:

Found 1 CUDA devices
id 0    b'GeForce RTX 2080 Ti'                              [SUPPORTED]
                      compute capability: 7.5
                           pci device id: 0
                              pci bus id: 1
Summary:
    1/1 devices are supported

Cpu info:
    python_version: 3.8.8.final.0 (64 bit)
    cpuinfo_version: [7, 0, 0]
    cpuinfo_version_string: 7.0.0
    arch: X86_64
    bits: 64
    count: 8
    arch_string_raw: AMD64
    vendor_id_raw: GenuineIntel
    brand_raw: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
    hz_advertised_friendly: 4.0000 GHz
    hz_actual_friendly: 4.0010 GHz
    hz_advertised: [4000000000, 0]
    hz_actual: [4001000000, 0]
    l2_cache_size: 1048576
    stepping: 3
    model: 60
    family: 6
    l3_cache_size: 8388608
    flags: ['3dnow', 'abm', 'acpi', 'aes', 'apic', 'avx', 'avx2', 'bmi1', 'bmi2', 'clflush', 'cmov', 'cx16', 'cx8', 'de', 'dts', 'erms', 'est', 'f16c', 'fma', 'fpu', 'fxsr', 'ht', 'hypervisor', 'ia64', 'invpcid', 'lahf_lm', 'mca', 'mce', 'mmx', 'movbe', 'msr', 'mtrr', 'osxsave', 'pae', 'pat', 'pbe', 'pcid', 'pclmulqdq', 'pdcm', 'pge', 'pni', 'popcnt', 'pse', 'pse36', 'rdrnd', 'sep', 'serial', 'smep', 'ss', 'sse', 'sse2', 'sse4_1', 'sse4_2', 'ssse3', 'tm', 'tm2', 'tsc', 'vme', 'xsave', 'xtpr']
    l2_cache_line_size: 256
    l2_cache_associativity: 6

                Time comparisons for CPU vs GPU implementations:

CPU execution time:                    0.327143s running 1 loops
CPU full setup/execution time:         0.980959s
GPU kernel execution time:             0.088015s running 1 loops
GPU full kernel setup/execution time:  0.868085s
                Outputs are equivalent
CPU execution time:                    1.539007s running 5 loops
CPU full setup/execution time:         2.134781s
GPU kernel execution time:             0.097627s running 5 loops
GPU full kernel setup/execution time:  0.695104s
                Outputs are equivalent
CPU execution time:                    3.463488s running 10 loops
CPU full setup/execution time:         4.310506s
GPU kernel execution time:             0.122363s running 10 loops
GPU full kernel setup/execution time:  0.655500s
                Outputs are equivalent
CPU execution time:                    6.416840s running 20 loops
CPU full setup/execution time:         7.011254s
GPU kernel execution time:             0.158903s running 20 loops
GPU full kernel setup/execution time:  0.723226s
                Outputs are equivalent
CPU execution time:                    9.285086s running 30 loops
CPU full setup/execution time:         9.890282s
GPU kernel execution time:             0.209807s running 30 loops
GPU full kernel setup/execution time:  0.728618s
                Outputs are equivalent
CPU execution time:                    12.610949s running 40 loops
CPU full setup/execution time:         13.177427s
GPU kernel execution time:             0.253696s running 40 loops
GPU full kernel setup/execution time:  0.836536s
                Outputs are equivalent
CPU execution time:                    15.376767s running 50 loops
CPU full setup/execution time:         15.976361s
GPU kernel execution time:             0.289626s running 50 loops
GPU full kernel setup/execution time:  0.841918s
                Outputs are equivalent

Process finished with exit code 0

If I'm being honest, these results both agree and disagree with my expectations.老实说，这些结果既符合我的期望，也符合我的期望。 I had expected at least the single loop function calls to see the CPU implementation outperform that of the GPU, but it doesn't.我曾预计至少单循环 function 调用会看到 CPU 实现优于 GPU 的实现，但事实并非如此。 :v/ Though, the seemingly linear increase in time cost for the CPU as the number of loops increased was expected. :v/ 虽然，随着循环次数的增加，CPU 的时间成本看似线性增加是意料之中的。

As for the GPU performance, I really don't know why the time cost for increasing loop counts seems to be logarithmic growth (I would have to plot the data points to see it more clearly).至于 GPU 的性能，我真的不知道为什么增加循环次数的时间成本似乎是对数增长（我必须 plot 数据点才能更清楚地看到它）。

Regardless, the results you see will vary according to your machine, but I would be curious at what cuda compute level the GPU results fall to match that of a CPU.无论如何，您看到的结果会因您的机器而异，但我很好奇 cuda 的计算级别 GPU 结果与 CPU 的计算级别相匹配。

即使对于巨型矩阵，NUMBA CUDA 也比并行 CPU 慢

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-07-24 02:40:26

解决方案2
1 2021-03-11 21:24:51

Code代码

Output: Output：

即使对于巨型矩阵，NUMBA CUDA 也比并行 CPU 慢

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-07-24 02:40:26

解决方案2 1 2021-03-11 21:24:51

Code代码

Output: Output：

解决方案1
3 已采纳 2020-07-24 02:40:26

解决方案2
1 2021-03-11 21:24:51