NUMBA CUDA slower than parallel CPU even for giant matrices

Question

There are only a few examples online on using cuda for numba and I find them all to be slower than the parallel CPU method. Vectorise with CUDA target and stencils are even worse so I tried to create a custom kernel. The one blogpost you find everywhere is https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c . This example is a simple blur filter:

import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda


@numba.cuda.jit
def smooth_gpu(x, out):
    i, j = cuda.grid(2)
    n, m = x.shape

    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                    x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                    x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

x_gpu = np.ones((10000, 10000), dtype='float32')
out_gpu = np.zeros((10000, 10000), dtype='float32')

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
print("GPU Time: {0:1.6f}s ".format(time.time() - start_time))

and the CPU version:

x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')


@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,np.shape(x)[0]-1):
        for j in range(1,np.shape(x)[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: {0:1.6f}s ".format(time.time() - start_time))

I get ~500 ms for the GPU version and 50 ms for the CPU one. What is going on?

Answer 1

There are 2 things I would point out:

You are including in your timing of the GPU version the time it takes to transfer the input array from host to device, and the results from device to host. If this is the intent of your comparison, then so be it; the conclusion is the GPU is not suited to this task (in an interesting way).
The GPU code while giving correct results is not organized for good performance. The issue lies here:
```
 i, j = cuda.grid(2)
```
coupled with the order those indices are being used to access data:
```
 out[i, j] = (x[i - 1, j - 1]...
```
this results in inefficient access in the GPU. We can fix that by reversing one of the two orders depicted above.

Here are your codes tweaked slightly taking into account both issues above:

$ cat t29a.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda


x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')


@njit(parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,x.shape[0]-1):
        for j in range(1,x.shape[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: {0:1.6f}s ".format(time.time() - start_time))
$ python t29a.py
CPU Time: 0.161944s

$ cat t29.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda
import math

@numba.cuda.jit
def smooth_gpu(x, out):
    j, i = cuda.grid(2)
    m, n = x.shape

    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                    x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                    x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

x = np.ones((10000, 10000), dtype='float32')
out = np.zeros((10000, 10000), dtype='float32')
x_gpu = cuda.to_device(x)
out_gpu = cuda.device_array_like(out)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
cuda.synchronize()
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
cuda.synchronize()
print("GPU Time: {0:1.6f}s ".format(time.time() - start_time))
$ python t29.py
GPU Time: 0.021776s
$

So we see if we adjust for both issues indicated, the GPU (a GTX 960 in my case) is about 8x faster than the CPU. Such measurements depend to some degree on the CPU and GPU used for comparison -- you shouldn't assume my measurements are comparable to yours -- its better for you to run these modified codes for comparison. However, the data transfer time certainly exceeds the GPU computation time by a sizeable margin, and in my case also exceeds the CPU computation time. This means (in my case at least, not a particularly fast system in any respect) that even if we reduced the GPU computation time to zero, the cost to transfer the data would still exceed the CPU computation time cost.

Therefore, when you run into such a situation, it's impossible to win. About the only advice that can be given then is "don't do that", ie find a more interesting and more complex problem for the GPU to solve. If we make the problem very simple computationally, such as this one, or such as vector add, and that is the only thing you want to do on the GPU, it is almost never an interesting comparison to doing it on the CPU. Hopefully you can see that making the matrix bigger doesn't help much here, because it also impacts the data transfer time/cost.

If we factor out the data transfer cost (and don't make performance crippling mistakes in our GPU code), according to my testing the GPU is faster than the CPU. If we include the data transfer cost, for this very simple problem, its quite possible that there is no way the GPU can be faster than the CPU (even if the GPU computation time were reduced to zero).

There is no doubt that more could be done to slightly improve the GPU case (eg change the block shape, use shared memory, etc.), but I personally don't wish to spend my time polishing uninteresting things.

You can get additional description of Numba GPU memory management here .

A general description of the memory efficiency issue related to index ordering is here

Answer 2

I found this comparison interesting, and wanted to look into the impact of reusing compiled kernels, cuda streams, and randomized data to ensure no fancy compiler optimizations were skewing what we saw.

I modified the code sample posted by Robert Crovella and ran the script on a modest ML rig at school:

Code

import numpy as np
from time import perf_counter
from numba import njit, prange,cuda

# cpuinfo is a third party package from here:
#   https://github.com/workhorsy/py-cpuinfo
# or you can just install it using pip with:
#   python -m pip install -U py-cpuinfo
from cpuinfo import get_cpu_info

print("Some diagnostic info for the system running this script:")
# prints information about the cuda GPU
cuda.detect()
print()
# Prints a json string describing the cpu
s = get_cpu_info()
print("Cpu info")
for k,v in s.items():
    print(f"\t{k}: {v}")
print()

cpu_s1 = "CPU execution time:"
cpu_s2 = "CPU full setup/execution time:"
gpu_s1 = "GPU kernel execution time:"
gpu_s2 = "GPU full kernel setup/execution time:"
l = len(gpu_s2) + 1
# using randomized floats to ensure there isn't some compiler optimization that
# recognizes that all values of the x array are constant 1's and does something
# goofy under the hood. Each timing scenario will then use a copy of this array.
common_x = np.random.random((10000, 10000)).astype(np.float32)

def time_njit(n_loops=2):
    start_time_full_function = perf_counter()

    @njit(parallel=True,nogil=True)
    def smooth_cpu(x, out):
        h,w = x.shape
        for i in prange(1,h-1):
            for j in range(1,w-1):
                out[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] +
                                  x[i - 1, j + 1] + x[i    , j - 1] +
                                  x[i    , j]     + x[i    , j + 1] +
                                  x[i + 1, j - 1] + x[i + 1, j] +
                                  x[i + 1, j + 1]) / 9


    pre_x = np.ones((10,10),dtype=common_x.dtype)
    pre_out = np.ones((10,10),dtype=common_x.dtype)
    _x = common_x.copy()
    _out = np.zeros_like(_x)
    # run on cpu
    smooth_cpu(pre_x, pre_out) # compile before measuring time
    start_time = perf_counter()
    for _ in range(n_loops):
        # realistically, we wouldn't typically run just a single blurring pass
        smooth_cpu(_x, _out)
        smooth_cpu(_out,_x)
    end_time = perf_counter()
    end_time_full_function = perf_counter()
    print(f"{cpu_s1:<{l}} {end_time - start_time:1.6f}s running {n_loops} loops"
          f"\n{cpu_s2:<{l}} {end_time_full_function - start_time_full_function:1.6f}s")
    return _x



def time_cuda(n_loops=2):
    """There is room for optimization in how we use cuda.shared.array memory on the GPU
    -- where I'm not aware of any analogues tricks for the cpu function -- that would
    allow us to minimize the number of times each thread-block needs to access data in
    the GPU's global memory. But such an implementation would take us deeper into the
    weeds than this toy problem calls for.

    Maybe if I need to take a break from my other work later I'll come back to this
    and flesh out an example of what I mean.
    """
    start_time_full_function = perf_counter()
    @cuda.jit
    def smooth_gpu(x, out):
        """slight change to the cuda kernel. This version uses **striding** to reduce
        processor overhead spent allocating and deallocating a lot of thread blocks
        that ultimately have each thread compute a single calculation before being
        disposed of.

        This way we offset some of the overhead cost spent on block allocation by
         making each block do a bit more work.

        Note: For this to work right, we have to allocate fewer blocks with
              our `blockspergrid_j` and `blockspergrid_i` variables.
        """
        jstart, istart = cuda.grid(2)
        jstep, istep = cuda.gridsize(2)
        rows,cols = x.shape
        # note that for strided kernels, thread indices
        # are completely independent of the data size/shape
        for i in range(istart+1,rows-1,istep):
            for j in range(jstart+1,cols-1,jstep):
                # Because we created x and out using column-major memory ordering,
                # we want to make sure the most frequently changing index (j)
                # is iterating through the last dimension of the array.
                out[i,j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                            x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                            x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

    _x = common_x.copy()
    _out = np.zeros_like(_x)
    stream = cuda.stream()
    x_gpu = cuda.to_device(_x,stream)
    out_gpu = cuda.to_device(_out,stream)
    tpbj = 16
    tpbi = 16
    threadsperblock = tpbj,tpbi
    blockspergrid_j = (_x.shape[0]+tpbj-1) // tpbj
    blockspergrid_i = (_x.shape[1]+tpbi-1) // tpbi
    # reduce the number of blocks in each axis
    # by a quarter to give room for striding
    blockspergrid = (blockspergrid_j//4, blockspergrid_i//4)
    # run on gpu
    compiled = smooth_gpu[blockspergrid, threadsperblock, stream] # compile before measuring time

    start_time = perf_counter()
    for _ in range(n_loops):
        # realistically, we wouldn't typically run just a single blurring pass
        compiled(x_gpu, out_gpu)
        compiled(out_gpu,x_gpu)
    x_gpu.copy_to_host(_out,stream)
    stream.synchronize()
    end_time = perf_counter()
    end_time_full_function = perf_counter()
    print(f"{gpu_s1:<{l}} {end_time-start_time:1.6f}s running {n_loops} loops"
          f"\n{gpu_s2:<{l}} {end_time_full_function-start_time_full_function:1.6f}s")
    return _out

if __name__ == '__main__':
    a = time_njit(1)
    b = time_cuda(1)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"{'    '*4}Outputs are equivalent")
    a = time_njit(5)
    b = time_cuda(5)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"{'    '*4}Results are equivalent")
    a = time_njit(10)
    b = time_cuda(10)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"{'    '*4}Results are equivalent")
    a = time_njit(20)
    b = time_cuda(20)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"{'    '*4}Results are equivalent")

Output:

Some diagnostic info for the system running this script:

Found 1 CUDA devices
id 0    b'GeForce RTX 2080 Ti'                              [SUPPORTED]
                      compute capability: 7.5
                           pci device id: 0
                              pci bus id: 1
Summary:
    1/1 devices are supported

Cpu info:
    python_version: 3.8.8.final.0 (64 bit)
    cpuinfo_version: [7, 0, 0]
    cpuinfo_version_string: 7.0.0
    arch: X86_64
    bits: 64
    count: 8
    arch_string_raw: AMD64
    vendor_id_raw: GenuineIntel
    brand_raw: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
    hz_advertised_friendly: 4.0000 GHz
    hz_actual_friendly: 4.0010 GHz
    hz_advertised: [4000000000, 0]
    hz_actual: [4001000000, 0]
    l2_cache_size: 1048576
    stepping: 3
    model: 60
    family: 6
    l3_cache_size: 8388608
    flags: ['3dnow', 'abm', 'acpi', 'aes', 'apic', 'avx', 'avx2', 'bmi1', 'bmi2', 'clflush', 'cmov', 'cx16', 'cx8', 'de', 'dts', 'erms', 'est', 'f16c', 'fma', 'fpu', 'fxsr', 'ht', 'hypervisor', 'ia64', 'invpcid', 'lahf_lm', 'mca', 'mce', 'mmx', 'movbe', 'msr', 'mtrr', 'osxsave', 'pae', 'pat', 'pbe', 'pcid', 'pclmulqdq', 'pdcm', 'pge', 'pni', 'popcnt', 'pse', 'pse36', 'rdrnd', 'sep', 'serial', 'smep', 'ss', 'sse', 'sse2', 'sse4_1', 'sse4_2', 'ssse3', 'tm', 'tm2', 'tsc', 'vme', 'xsave', 'xtpr']
    l2_cache_line_size: 256
    l2_cache_associativity: 6

                Time comparisons for CPU vs GPU implementations:

CPU execution time:                    0.327143s running 1 loops
CPU full setup/execution time:         0.980959s
GPU kernel execution time:             0.088015s running 1 loops
GPU full kernel setup/execution time:  0.868085s
                Outputs are equivalent
CPU execution time:                    1.539007s running 5 loops
CPU full setup/execution time:         2.134781s
GPU kernel execution time:             0.097627s running 5 loops
GPU full kernel setup/execution time:  0.695104s
                Outputs are equivalent
CPU execution time:                    3.463488s running 10 loops
CPU full setup/execution time:         4.310506s
GPU kernel execution time:             0.122363s running 10 loops
GPU full kernel setup/execution time:  0.655500s
                Outputs are equivalent
CPU execution time:                    6.416840s running 20 loops
CPU full setup/execution time:         7.011254s
GPU kernel execution time:             0.158903s running 20 loops
GPU full kernel setup/execution time:  0.723226s
                Outputs are equivalent
CPU execution time:                    9.285086s running 30 loops
CPU full setup/execution time:         9.890282s
GPU kernel execution time:             0.209807s running 30 loops
GPU full kernel setup/execution time:  0.728618s
                Outputs are equivalent
CPU execution time:                    12.610949s running 40 loops
CPU full setup/execution time:         13.177427s
GPU kernel execution time:             0.253696s running 40 loops
GPU full kernel setup/execution time:  0.836536s
                Outputs are equivalent
CPU execution time:                    15.376767s running 50 loops
CPU full setup/execution time:         15.976361s
GPU kernel execution time:             0.289626s running 50 loops
GPU full kernel setup/execution time:  0.841918s
                Outputs are equivalent

Process finished with exit code 0

If I'm being honest, these results both agree and disagree with my expectations. I had expected at least the single loop function calls to see the CPU implementation outperform that of the GPU, but it doesn't. :v/ Though, the seemingly linear increase in time cost for the CPU as the number of loops increased was expected.

As for the GPU performance, I really don't know why the time cost for increasing loop counts seems to be logarithmic growth (I would have to plot the data points to see it more clearly).

Regardless, the results you see will vary according to your machine, but I would be curious at what cuda compute level the GPU results fall to match that of a CPU.

NUMBA CUDA slower than parallel CPU even for giant matrices

Question

2 answers

solution1
3 ACCPTED 2020-07-24 02:40:26

solution2
1 2021-03-11 21:24:51

Code

Output:

NUMBA CUDA slower than parallel CPU even for giant matrices

Question

2 answers

solution1 3 ACCPTED 2020-07-24 02:40:26

solution2 1 2021-03-11 21:24:51

Code

Output:

solution1
3 ACCPTED 2020-07-24 02:40:26

solution2
1 2021-03-11 21:24:51