CUPY: matrix-vector multiplication is faster than vector-vector multiplication and l2norm for small sizes

Question

I am transferring my CPU code into GPU. While I was optimizing it, I found a controversial performance behavior:

Consider a simple task of calculating vector's L2 norm. For vectors with large number of elements my performance scales as expected, however for small number (256) it is not:

import cupy as cp
a=cp.random.rand(256)

%timeit cp.linalg.norm(a)
32.3 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Now, let's compare it with matrix-vector dot product:

b=cp.random.rand(256,256)
%timeit cp.dot(a,b)
8.36 µs ± 80.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

You can see that unexpectedly matrix-vector product is 4 times faster. Why is this the case?

I started digging into this toy problem. First of all, I've created my custom reduction kernel:

l2norm = cp.ReductionKernel('T x','T y',  'x * x','a + b', 'y = sqrt(a)', '0', 'l2norm')

With this kernel my execution time was ~17 microseconds, twice better than with linalg.norm, but still twice worse than matrix-vector dot product. I am convinced that this kernel is very well optimized, so that C++ Thurst implementation will not do much better.

I've also tried calculating norm with cp.sqrt(a.dot(a)) . I found out that this is very inefficient, as the vector-vector dot product a.dot(a) takes longer time than the matrix-vector product a.dot(b) !!!

I do understand that with this small problem size the performance is bandwidth-limited, so significant portion of the time can be spend for creating the arrays, copying/fetching data, rather than arithmetics. But even in this case I would expect L2 norm be a little faster than matrix-vector product, as it simply requires O(N) operations and fetches, and the result is a single number. In the case of matrix-vector product, I don't even preallocate the result, I do N^2 operations and fetch O(N^2) numbers from memory.

With large number of elements (>1000 elements) the performance scales as expected.

Ubuntu 18.05, anaconda distribution, python 3.8.3, cupy 8.2., nvcc 11.0

Answer 1

First, you are only measuring CPU time, kernels are executed asynchronously and your measurements include only the part of the time preparing the kernel launch, but you are not waiting for the actual kernel execution.

If we change the code to take this in account by measuring with cupyx.time.repeat we get

import cupy as cp
import cupyx

a = cp.random.rand(256)
cp.linalg.norm(a)
print(cupyx.time.repeat(cp.linalg.norm, (a,)))
b = cp.random.rand(256, 256)
print(cupyx.time.repeat(cp.dot, (a, b)))
c = cp.zeros(())
l2norm = cp.ReductionKernel(
    "T x", "T y", "x * x", "a + b", "y = sqrt(a)", "0", "l2norm"
)
print(cupyx.time.repeat(l2norm, (a, c)))

And the results are

norm                :    CPU:   32.077 us   +/- 2.206 (min:   30.961 / max:   64.160) us     GPU-0:   36.275 us   +/- 2.223 (min:   34.880 / max:   68.512) us
dot                 :    CPU:    9.572 us   +/- 0.261 (min:    9.235 / max:   15.934) us     GPU-0:   13.640 us   +/- 0.347 (min:   12.896 / max:   21.440) us
l2norm              :    CPU:   10.216 us   +/- 0.578 (min:    9.847 / max:   23.790) us     GPU-0:   14.396 us   +/- 0.591 (min:   13.504 / max:   27.936) us

cupy.linalg.norm is launching several kernels to calculate the norm, hence the high CPU time of 32 us, and the accumulated 36 us of GPU time. Here the array size is so small that this is mostly the constant overhead of several kernels being added.

dot just calls the cublas function, so it cpu time is greatly reduced and the GPU time is pretty fast, but with the reduced size this is pure overhead.

Finally your reduction kernel has a bit more of cpu time because of the steps needed to generate the actual kernel, but the gpu execution is roughly the same as the dot product.

If we increase the arrays size to 4096 these are the results:

norm                :    CPU:   31.637 us   +/- 2.200 (min:   30.487 / max:   62.955) us     GPU-0:   35.741 us   +/- 2.215 (min:   34.336 / max:   67.008) us
dot                 :    CPU:    9.547 us   +/- 3.753 (min:    9.051 / max:  370.309) us     GPU-0:  244.535 us   +/- 3.791 (min:  241.952 / max:  598.624) us
l2norm              :    CPU:   10.170 us   +/- 0.542 (min:    9.845 / max:   17.006) us     GPU-0:   16.106 us   +/- 0.725 (min:   15.168 / max:   29.600) us

Note that the GPU exec time only changes for the dot product, which makes this consistent with your observations:). For the other kernels, the size is still too small for the actual kernel execution time to be significant compared to the initial overhead.

CUPY: matrix-vector multiplication is faster than vector-vector multiplication and l2norm for small sizes

Question

1 answers

solution1
2 ACCPTED 2020-12-08 07:56:18

CUPY: matrix-vector multiplication is faster than vector-vector multiplication and l2norm for small sizes

Question

1 answers

solution1 2 ACCPTED 2020-12-08 07:56:18

solution1
2 ACCPTED 2020-12-08 07:56:18