在简单的numpy操作中，Cuda GPU比CPU慢

Question

I am using this code based on this article to see the GPU accelerations, but all I can see is slowdown: 我正在使用基于这篇文章的代码来查看GPU加速，但我能看到的只是减速：

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys

if len(sys.argv) != 3:
    exit("Usage: " + sys.argv[0] + " [cuda|cpu] N(100000-11500000)")


@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
    return a + b

def main():
    N = int(sys.argv[2])
    A = np.ones(N, dtype=np.float32)
    B = np.ones(N, dtype=np.float32)

    start = timer()
    C = VectorAdd(A, B)
    elapsed_time = timer() - start
    #print("C[:5] = " + str(C[:5]))
    #print("C[-5:] = " + str(C[-5:]))
    print("Time: {}".format(elapsed_time))

main()

The results: 结果：

$ python speed.py cpu 100000
Time: 0.0001056949986377731
$ python speed.py cuda 100000
Time: 0.11871792199963238

$ python speed.py cpu 11500000
Time: 0.013704434997634962
$ python speed.py cuda 11500000
Time: 0.47120747699955245

I cannot send bigger vector as that will generate a numba.cuda.cudadrv.driver.CudaAPIError: Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE exception.` 我无法发送更大的向量，因为它将生成numba.cuda.cudadrv.driver.CudaAPIError: Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE异常。

The output of nvidia-smi is nvidia-smi的输出是

Fri Dec  8 10:36:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98                 Driver Version: 384.98                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 2000D        Off  | 00000000:01:00.0  On |                  N/A |
| 30%   36C   P12    N/A /  N/A |    184MiB /   959MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       933      G   /usr/lib/xorg/Xorg                            94MiB |
|    0       985      G   /usr/bin/gnome-shell                          86MiB |
+-----------------------------------------------------------------------------+

Details of the CPU CPU的详细信息

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               58
Model name:          Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
Stepping:            9
CPU MHz:             3300.135
CPU max MHz:         3700.0000
CPU min MHz:         1600.0000
BogoMIPS:            6600.27
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

The GPU is Nvidia Quadro 2000D with 192 CUDA cores and 1Gb RAM. GPU是Nvidia Quadro 2000D，具有192个CUDA内核和1Gb RAM。

More complex operation: 更复杂的操作：

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys

if len(sys.argv) != 3:
    exit("Usage: " + sys.argv[0] + " [cuda|cpu] N()")


@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
    return a * b

def main():
    N = int(sys.argv[2])
    A = np.zeros((N, N), dtype='f')
    B = np.zeros((N, N), dtype='f')
    A[:] = np.random.randn(*A.shape)
    B[:] = np.random.randn(*B.shape)

    start = timer()
    C = VectorAdd(A, B)
    elapsed_time = timer() - start
    print("Time: {}".format(elapsed_time))

main()

Results: 结果：

$ python complex.py cpu 3000
Time: 0.010573603001830634
$ python complex.py cuda 3000
Time: 0.3956961739968392
$ python complex.py cpu 30
Time: 9.693001629784703e-06
$ python complex.py cuda 30
Time: 0.10848476299725007

Any idea why? 知道为什么吗？

Answer 1

Probably your array is too small and the operation too simple to offset the cost of data transfer associated to the GPU. 可能您的阵列太小，操作太简单，无法抵消与GPU相关的数据传输成本。 Other way to see it, is that you're not being fair in your timing since for the GPU it also is timing the memory transfer time and not only the processing time. 另一种看待它的方式是，你的时机不公平，因为对于GPU而言，它也计时内存传输时间而不仅仅是处理时间。

Try some more challenging example, maybe first an element wise big matrix multiplication and then a matrix multiplication. 尝试一些更具挑战性的例子，可能首先是元素明智的大矩阵乘法，然后是矩阵乘法。

In the end, the power of the GPU is to perform many operations on the same data so you end up paying only once the data transfer cost. 最后，GPU的功能是对相同的数据执行许多操作，因此您最终只需支付一次数据传输费用。

Answer 2

Despite the example being on the web site of Nvidia used to show "how to use the GPU", plain matrix addition will be probably slower using GPU that using the CPU. 尽管Nvidia网站上的例子曾用于展示“如何使用GPU”，但使用GPU的GPU使用普通矩阵的速度可能会更慢。 Primarily due to the overhead of copying over the data to the GPU. 主要是由于将数据复制到GPU的开销。

Even simple math calculations might be slower. 即使简单的数学计算也可能会更慢。 Heavier computations can already show the gain. 较重的计算已经可以显示增益。 I've put my results together in an article showing speed improvement with GPU, cuda, and numpy 我把结果放在一篇文章中，显示了GPU，cuda和numpy的速度提升

In a nutshell the question was which is bigger 简而言之，问题是哪个更大

CPU time CPU时间

or 要么

copy to GPU + GPU time + copy from GPU 复制到GPU + GPU时间+从GPU复制

在简单的numpy操作中，Cuda GPU比CPU慢

问题描述

2 个解决方案

解决方案1
6 2017-12-08 09:11:19

解决方案2
2 已采纳 2017-12-10 08:51:17

在简单的numpy操作中，Cuda GPU比CPU慢

问题描述

2 个解决方案

解决方案1 6 2017-12-08 09:11:19

解决方案2 2 已采纳 2017-12-10 08:51:17

解决方案1
6 2017-12-08 09:11:19

解决方案2
2 已采纳 2017-12-10 08:51:17