在簡單的numpy操作中，Cuda GPU比CPU慢

Question

我正在使用基於這篇文章的代碼來查看GPU加速，但我能看到的只是減速：

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys

if len(sys.argv) != 3:
    exit("Usage: " + sys.argv[0] + " [cuda|cpu] N(100000-11500000)")


@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
    return a + b

def main():
    N = int(sys.argv[2])
    A = np.ones(N, dtype=np.float32)
    B = np.ones(N, dtype=np.float32)

    start = timer()
    C = VectorAdd(A, B)
    elapsed_time = timer() - start
    #print("C[:5] = " + str(C[:5]))
    #print("C[-5:] = " + str(C[-5:]))
    print("Time: {}".format(elapsed_time))

main()

結果：

$ python speed.py cpu 100000
Time: 0.0001056949986377731
$ python speed.py cuda 100000
Time: 0.11871792199963238

$ python speed.py cpu 11500000
Time: 0.013704434997634962
$ python speed.py cuda 11500000
Time: 0.47120747699955245

我無法發送更大的向量，因為它將生成numba.cuda.cudadrv.driver.CudaAPIError: Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE異常。

nvidia-smi的輸出是

Fri Dec  8 10:36:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98                 Driver Version: 384.98                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 2000D        Off  | 00000000:01:00.0  On |                  N/A |
| 30%   36C   P12    N/A /  N/A |    184MiB /   959MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       933      G   /usr/lib/xorg/Xorg                            94MiB |
|    0       985      G   /usr/bin/gnome-shell                          86MiB |
+-----------------------------------------------------------------------------+

CPU的詳細信息

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               58
Model name:          Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
Stepping:            9
CPU MHz:             3300.135
CPU max MHz:         3700.0000
CPU min MHz:         1600.0000
BogoMIPS:            6600.27
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

GPU是Nvidia Quadro 2000D，具有192個CUDA內核和1Gb RAM。

更復雜的操作：

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys

if len(sys.argv) != 3:
    exit("Usage: " + sys.argv[0] + " [cuda|cpu] N()")


@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
    return a * b

def main():
    N = int(sys.argv[2])
    A = np.zeros((N, N), dtype='f')
    B = np.zeros((N, N), dtype='f')
    A[:] = np.random.randn(*A.shape)
    B[:] = np.random.randn(*B.shape)

    start = timer()
    C = VectorAdd(A, B)
    elapsed_time = timer() - start
    print("Time: {}".format(elapsed_time))

main()

結果：

$ python complex.py cpu 3000
Time: 0.010573603001830634
$ python complex.py cuda 3000
Time: 0.3956961739968392
$ python complex.py cpu 30
Time: 9.693001629784703e-06
$ python complex.py cuda 30
Time: 0.10848476299725007

知道為什么嗎？

Answer 1

可能您的陣列太小，操作太簡單，無法抵消與GPU相關的數據傳輸成本。 另一種看待它的方式是，你的時機不公平，因為對於GPU而言，它也計時內存傳輸時間而不僅僅是處理時間。

嘗試一些更具挑戰性的例子，可能首先是元素明智的大矩陣乘法，然后是矩陣乘法。

最后，GPU的功能是對相同的數據執行許多操作，因此您最終只需支付一次數據傳輸費用。

Answer 2

盡管Nvidia網站上的例子曾用於展示“如何使用GPU”，但使用GPU的GPU使用普通矩陣的速度可能會更慢。 主要是由於將數據復制到GPU的開銷。

即使簡單的數學計算也可能會更慢。 較重的計算已經可以顯示增益。 我把結果放在一篇文章中，顯示了GPU，cuda和numpy的速度提升

簡而言之，問題是哪個更大

CPU時間

要么

復制到GPU + GPU時間+從GPU復制

在簡單的numpy操作中，Cuda GPU比CPU慢

問題描述

2 個解決方案

解決方案1
6 2017-12-08 09:11:19

解決方案2
2 已采納 2017-12-10 08:51:17

在簡單的numpy操作中，Cuda GPU比CPU慢

問題描述

2 個解決方案

解決方案1 6 2017-12-08 09:11:19

解決方案2 2 已采納 2017-12-10 08:51:17

解決方案1
6 2017-12-08 09:11:19

解決方案2
2 已采納 2017-12-10 08:51:17