為什么GPU上的乘法比CPU慢？

Question

這是我的代碼（模擬前饋神經網絡）：

import torch
import time

print(torch.cuda.is_available())    # True
device = torch.device('cuda:0' )

a = torch.tensor([1,2,3,4,5,6]).float().reshape(-1,1)
w1 = torch.rand(120,6)
w2 = torch.rand(1,120)
b1 = torch.rand(120,1)
b2 = torch.rand(1,1).reshape(1,1)

start = time.time()
for _ in range(100000):
    ans = torch.mm(w2, torch.mm(w1,a)+b1)+b2
end = time.time()
print(end-start)                    # 1.2725720405578613 seconds

a = a.to(device)
w1 = w1.to(device)
w2 = w2.to(device)
b1 = b1.to(device)
b2 = b2.to(device)

start = time.time()
for _ in range(100000):
    ans = torch.mm(w2, torch.mm(w1,a)+b1)+b2
end = time.time()
print(end-start)                    # 5.6569812297821045 seconds

我不知道如果我做了錯誤的方式還是什么，我怎么可以改變我的代碼表明，GPU的速度更快然后在矩陣乘法CPU？

Answer 1

原因可能有很多：

你的模型很簡單。
對於 GPU 計算，存在與 GPU 內存之間的內存傳輸成本
您的計算是在小數據批次上進行的，可能有更大的數據樣本，您應該會在 GPU 上看到比 CPU 更好的性能
我們不應該忘記的緩存，計算出同樣的操作了一遍又一遍，也許效果會更好生成隨機a每次運行張量

這是 pytorch 論壇上的一個主題： https ://discuss.pytorch.org/t/cpu-faster-than-gpu/25343

你也應該使用更好的分析器，就像在這個線程中解釋的那樣： https ://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964

Answer 2

CPU 到 GPU 的傳輸會帶來開銷。 您還可以觀察到，與前面的模型相比，第一層模型花費了大量時間。

因為，張量首先從主機內存轉移到 GPU 內存。 然后，cuda 內核對 CUDA 內存中的張量執行操作。

為什么GPU上的乘法比CPU慢？

問題描述

2 個解決方案

解決方案1
6 2020-10-27 14:45:41

解決方案2
0 2020-10-30 20:06:26

為什么GPU上的乘法比CPU慢？

問題描述

2 個解決方案

解決方案1 6 2020-10-27 14:45:41

解決方案2 0 2020-10-30 20:06:26

解決方案1
6 2020-10-27 14:45:41

解決方案2
0 2020-10-30 20:06:26