![](/img/trans.png)
[英]numba guvectorize target='parallel' slower than target='cpu'
[英]NUMBA CUDA slower than parallel CPU even for giant matrices
網上只有幾個關於使用 cuda 進行 numba 的示例,我發現它們都比並行 CPU 方法慢。 使用 CUDA 目標和模板進行矢量化甚至更糟,所以我嘗試創建自定義 kernel。 您到處都能找到的一篇博文是https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c 。 這個例子是一個簡單的模糊過濾器:
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda
@numba.cuda.jit
def smooth_gpu(x, out):
i, j = cuda.grid(2)
n, m = x.shape
if 1 <= i < n - 1 and 1 <= j < m - 1:
out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
x[i , j - 1] + x[i , j] + x[i , j + 1] +
x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
x_gpu = np.ones((10000, 10000), dtype='float32')
out_gpu = np.zeros((10000, 10000), dtype='float32')
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
print("GPU Time: {0:1.6f}s ".format(time.time() - start_time))
和 CPU 版本:
x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')
@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):
for i in prange(1,np.shape(x)[0]-1):
for j in range(1,np.shape(x)[1]-1):
out_cpu[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i , j - 1] + x[i , j] + x[i , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: {0:1.6f}s ".format(time.time() - start_time))
GPU 版本大約為 500 毫秒,CPU 版本為 50 毫秒。 到底是怎么回事?
我要指出兩點:
您在 GPU 版本的時序中包括將輸入數組從主機傳輸到設備所需的時間,以及從設備到主機的結果。 如果這是您比較的目的,那就這樣吧; 結論是 GPU 不適合這項任務(以一種有趣的方式)。
GPU 代碼雖然給出了正確的結果,但並不是為了獲得良好的性能而組織起來的。 問題出在這里:
i, j = cuda.grid(2)
再加上這些索引用於訪問數據的順序:
out[i, j] = (x[i - 1, j - 1]...
這會導致 GPU 的訪問效率低下。 我們可以通過顛倒上面描述的兩個順序之一來解決這個問題。
考慮到上述兩個問題,以下是您的代碼稍作調整:
$ cat t29a.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda
x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')
@njit(parallel=True)
def smooth_cpu(x, out_cpu):
for i in prange(1,x.shape[0]-1):
for j in range(1,x.shape[1]-1):
out_cpu[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i , j - 1] + x[i , j] + x[i , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: {0:1.6f}s ".format(time.time() - start_time))
$ python t29a.py
CPU Time: 0.161944s
$ cat t29.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda
import math
@numba.cuda.jit
def smooth_gpu(x, out):
j, i = cuda.grid(2)
m, n = x.shape
if 1 <= i < n - 1 and 1 <= j < m - 1:
out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
x[i , j - 1] + x[i , j] + x[i , j + 1] +
x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
x = np.ones((10000, 10000), dtype='float32')
out = np.zeros((10000, 10000), dtype='float32')
x_gpu = cuda.to_device(x)
out_gpu = cuda.device_array_like(out)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
cuda.synchronize()
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
cuda.synchronize()
print("GPU Time: {0:1.6f}s ".format(time.time() - start_time))
$ python t29.py
GPU Time: 0.021776s
$
所以我們看看如果我們針對兩個問題進行調整,GPU(在我的例子中是 GTX 960)比 CPU 快大約 8 倍。 這樣的測量在某種程度上取決於用於比較的 CPU 和 GPU ——你不應該假設我的測量與你的測量相當——最好運行這些修改后的代碼進行比較。 但是,數據傳輸時間肯定比 GPU 的計算時間要大很多,而且在我的情況下也超過了 CPU 的計算時間。 這意味着(至少在我的情況下,在任何方面都不是特別快的系統)即使我們將 GPU 計算時間減少到零,傳輸數據的成本仍然會超過 CPU 計算時間成本。
因此,當你遇到這種情況時,是不可能獲勝的。 那時可以給出的唯一建議是“不要那樣做”,即找到一個更有趣、更復雜的問題讓 GPU 解決。 如果我們在計算上使問題變得非常簡單,比如這個問題,或者向量加法,這是你唯一想在 GPU 上做的事情,與在 CPU 上做的比較幾乎從來都不是一個有趣的比較。 希望您能看到讓矩陣變大在這里並沒有太大幫助,因為它也會影響數據傳輸時間/成本。
如果我們考慮到數據傳輸成本(並且不要在我們的 GPU 代碼中犯下嚴重的性能錯誤),根據我的測試,GPU 比 CPU 快。 如果我們包括數據傳輸成本,對於這個非常簡單的問題,很可能 GPU 不可能比 CPU 更快(即使 GPU 計算時間減少到零)。
毫無疑問,GPU 外殼可以做更多的改進(例如改變方塊形狀,使用共享 memory 等),但我個人不希望花時間打磨無趣的東西。
您可以在此處獲得 Numba GPU memory 管理的更多描述。
與索引排序相關的 memory 效率問題的一般描述在這里
我發現這個比較很有趣,並想研究重用編譯內核、cuda 流和隨機數據的影響,以確保沒有花哨的編譯器優化扭曲我們所看到的。
我修改了 Robert Crovella 發布的代碼示例,並在學校的一個適度的 ML 裝備上運行了腳本:
import numpy as np
from time import perf_counter
from numba import njit, prange,cuda
# cpuinfo is a third party package from here:
# https://github.com/workhorsy/py-cpuinfo
# or you can just install it using pip with:
# python -m pip install -U py-cpuinfo
from cpuinfo import get_cpu_info
print("Some diagnostic info for the system running this script:")
# prints information about the cuda GPU
cuda.detect()
print()
# Prints a json string describing the cpu
s = get_cpu_info()
print("Cpu info")
for k,v in s.items():
print(f"\t{k}: {v}")
print()
cpu_s1 = "CPU execution time:"
cpu_s2 = "CPU full setup/execution time:"
gpu_s1 = "GPU kernel execution time:"
gpu_s2 = "GPU full kernel setup/execution time:"
l = len(gpu_s2) + 1
# using randomized floats to ensure there isn't some compiler optimization that
# recognizes that all values of the x array are constant 1's and does something
# goofy under the hood. Each timing scenario will then use a copy of this array.
common_x = np.random.random((10000, 10000)).astype(np.float32)
def time_njit(n_loops=2):
start_time_full_function = perf_counter()
@njit(parallel=True,nogil=True)
def smooth_cpu(x, out):
h,w = x.shape
for i in prange(1,h-1):
for j in range(1,w-1):
out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] +
x[i - 1, j + 1] + x[i , j - 1] +
x[i , j] + x[i , j + 1] +
x[i + 1, j - 1] + x[i + 1, j] +
x[i + 1, j + 1]) / 9
pre_x = np.ones((10,10),dtype=common_x.dtype)
pre_out = np.ones((10,10),dtype=common_x.dtype)
_x = common_x.copy()
_out = np.zeros_like(_x)
# run on cpu
smooth_cpu(pre_x, pre_out) # compile before measuring time
start_time = perf_counter()
for _ in range(n_loops):
# realistically, we wouldn't typically run just a single blurring pass
smooth_cpu(_x, _out)
smooth_cpu(_out,_x)
end_time = perf_counter()
end_time_full_function = perf_counter()
print(f"{cpu_s1:<{l}} {end_time - start_time:1.6f}s running {n_loops} loops"
f"\n{cpu_s2:<{l}} {end_time_full_function - start_time_full_function:1.6f}s")
return _x
def time_cuda(n_loops=2):
"""There is room for optimization in how we use cuda.shared.array memory on the GPU
-- where I'm not aware of any analogues tricks for the cpu function -- that would
allow us to minimize the number of times each thread-block needs to access data in
the GPU's global memory. But such an implementation would take us deeper into the
weeds than this toy problem calls for.
Maybe if I need to take a break from my other work later I'll come back to this
and flesh out an example of what I mean.
"""
start_time_full_function = perf_counter()
@cuda.jit
def smooth_gpu(x, out):
"""slight change to the cuda kernel. This version uses **striding** to reduce
processor overhead spent allocating and deallocating a lot of thread blocks
that ultimately have each thread compute a single calculation before being
disposed of.
This way we offset some of the overhead cost spent on block allocation by
making each block do a bit more work.
Note: For this to work right, we have to allocate fewer blocks with
our `blockspergrid_j` and `blockspergrid_i` variables.
"""
jstart, istart = cuda.grid(2)
jstep, istep = cuda.gridsize(2)
rows,cols = x.shape
# note that for strided kernels, thread indices
# are completely independent of the data size/shape
for i in range(istart+1,rows-1,istep):
for j in range(jstart+1,cols-1,jstep):
# Because we created x and out using column-major memory ordering,
# we want to make sure the most frequently changing index (j)
# is iterating through the last dimension of the array.
out[i,j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
x[i , j - 1] + x[i , j] + x[i , j + 1] +
x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
_x = common_x.copy()
_out = np.zeros_like(_x)
stream = cuda.stream()
x_gpu = cuda.to_device(_x,stream)
out_gpu = cuda.to_device(_out,stream)
tpbj = 16
tpbi = 16
threadsperblock = tpbj,tpbi
blockspergrid_j = (_x.shape[0]+tpbj-1) // tpbj
blockspergrid_i = (_x.shape[1]+tpbi-1) // tpbi
# reduce the number of blocks in each axis
# by a quarter to give room for striding
blockspergrid = (blockspergrid_j//4, blockspergrid_i//4)
# run on gpu
compiled = smooth_gpu[blockspergrid, threadsperblock, stream] # compile before measuring time
start_time = perf_counter()
for _ in range(n_loops):
# realistically, we wouldn't typically run just a single blurring pass
compiled(x_gpu, out_gpu)
compiled(out_gpu,x_gpu)
x_gpu.copy_to_host(_out,stream)
stream.synchronize()
end_time = perf_counter()
end_time_full_function = perf_counter()
print(f"{gpu_s1:<{l}} {end_time-start_time:1.6f}s running {n_loops} loops"
f"\n{gpu_s2:<{l}} {end_time_full_function-start_time_full_function:1.6f}s")
return _out
if __name__ == '__main__':
a = time_njit(1)
b = time_cuda(1)
assert np.allclose(a,b),"The two functions didn't actually compute the same results"
print(f"{' '*4}Outputs are equivalent")
a = time_njit(5)
b = time_cuda(5)
assert np.allclose(a,b),"The two functions didn't actually compute the same results"
print(f"{' '*4}Results are equivalent")
a = time_njit(10)
b = time_cuda(10)
assert np.allclose(a,b),"The two functions didn't actually compute the same results"
print(f"{' '*4}Results are equivalent")
a = time_njit(20)
b = time_cuda(20)
assert np.allclose(a,b),"The two functions didn't actually compute the same results"
print(f"{' '*4}Results are equivalent")
Some diagnostic info for the system running this script:
Found 1 CUDA devices
id 0 b'GeForce RTX 2080 Ti' [SUPPORTED]
compute capability: 7.5
pci device id: 0
pci bus id: 1
Summary:
1/1 devices are supported
Cpu info:
python_version: 3.8.8.final.0 (64 bit)
cpuinfo_version: [7, 0, 0]
cpuinfo_version_string: 7.0.0
arch: X86_64
bits: 64
count: 8
arch_string_raw: AMD64
vendor_id_raw: GenuineIntel
brand_raw: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
hz_advertised_friendly: 4.0000 GHz
hz_actual_friendly: 4.0010 GHz
hz_advertised: [4000000000, 0]
hz_actual: [4001000000, 0]
l2_cache_size: 1048576
stepping: 3
model: 60
family: 6
l3_cache_size: 8388608
flags: ['3dnow', 'abm', 'acpi', 'aes', 'apic', 'avx', 'avx2', 'bmi1', 'bmi2', 'clflush', 'cmov', 'cx16', 'cx8', 'de', 'dts', 'erms', 'est', 'f16c', 'fma', 'fpu', 'fxsr', 'ht', 'hypervisor', 'ia64', 'invpcid', 'lahf_lm', 'mca', 'mce', 'mmx', 'movbe', 'msr', 'mtrr', 'osxsave', 'pae', 'pat', 'pbe', 'pcid', 'pclmulqdq', 'pdcm', 'pge', 'pni', 'popcnt', 'pse', 'pse36', 'rdrnd', 'sep', 'serial', 'smep', 'ss', 'sse', 'sse2', 'sse4_1', 'sse4_2', 'ssse3', 'tm', 'tm2', 'tsc', 'vme', 'xsave', 'xtpr']
l2_cache_line_size: 256
l2_cache_associativity: 6
Time comparisons for CPU vs GPU implementations:
CPU execution time: 0.327143s running 1 loops
CPU full setup/execution time: 0.980959s
GPU kernel execution time: 0.088015s running 1 loops
GPU full kernel setup/execution time: 0.868085s
Outputs are equivalent
CPU execution time: 1.539007s running 5 loops
CPU full setup/execution time: 2.134781s
GPU kernel execution time: 0.097627s running 5 loops
GPU full kernel setup/execution time: 0.695104s
Outputs are equivalent
CPU execution time: 3.463488s running 10 loops
CPU full setup/execution time: 4.310506s
GPU kernel execution time: 0.122363s running 10 loops
GPU full kernel setup/execution time: 0.655500s
Outputs are equivalent
CPU execution time: 6.416840s running 20 loops
CPU full setup/execution time: 7.011254s
GPU kernel execution time: 0.158903s running 20 loops
GPU full kernel setup/execution time: 0.723226s
Outputs are equivalent
CPU execution time: 9.285086s running 30 loops
CPU full setup/execution time: 9.890282s
GPU kernel execution time: 0.209807s running 30 loops
GPU full kernel setup/execution time: 0.728618s
Outputs are equivalent
CPU execution time: 12.610949s running 40 loops
CPU full setup/execution time: 13.177427s
GPU kernel execution time: 0.253696s running 40 loops
GPU full kernel setup/execution time: 0.836536s
Outputs are equivalent
CPU execution time: 15.376767s running 50 loops
CPU full setup/execution time: 15.976361s
GPU kernel execution time: 0.289626s running 50 loops
GPU full kernel setup/execution time: 0.841918s
Outputs are equivalent
Process finished with exit code 0
老實說,這些結果既符合我的期望,也符合我的期望。 我曾預計至少單循環 function 調用會看到 CPU 實現優於 GPU 的實現,但事實並非如此。 :v/ 雖然,隨着循環次數的增加,CPU 的時間成本看似線性增加是意料之中的。
至於 GPU 的性能,我真的不知道為什么增加循環次數的時間成本似乎是對數增長(我必須 plot 數據點才能更清楚地看到它)。
無論如何,您看到的結果會因您的機器而異,但我很好奇 cuda 的計算級別 GPU 結果與 CPU 的計算級別相匹配。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.