为什么我的 GPU 在矩阵运算中比 CPU 慢？

Question

CPU: i7-9750 @2.6GHz (with 16G DDR4 Ram); CPU：i7-9750 @2.6GHz（16G DDR4 内存）； GPU: Nvidia Geforce GTX 1600 TI (6G); GPU：英伟达 Geforce GTX 1600 TI (6G)； OS: Windows 10-64bit操作系统：Windows 10-64 位

I tried to see how fast the GPU is in doing basic matrix operations compared with CPU, and I basically followed this https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56 .我试图看看与 CPU 相比，GPU 在执行基本矩阵运算方面的速度有多快，我基本上遵循了这个https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster -4b920dda1f56 。 The following is my super simple code以下是我的超级简单代码

import numpy as np
import cupy as cp
import time

### Numpy and CPU
s = time.time()
A = np.random.random([10000,10000]); B = np.random.random([10000,10000])
CPU = np.matmul(A,B); CPU *= 5
e = time.time()
print(f'CPU time: {e - s: .2f}')

### CuPy and GPU
s = time.time()
C= cp.random.random([10000,10000]); D = cp.random.random([10000,10000])
GPU = cp.matmul(C,D); GPU *= 5
cp.cuda.Stream.null.synchronize()  
# to let the code finish executing on the GPU before calculating the time
e = time.time()
print(f'GPU time: {e - s: .2f}')

Ironically, it shows CPU time: 11.74 GPU time: 12.56具有讽刺意味的是，它显示CPU 时间：11.74 GPU 时间：12.56

This really confuse me.这真的让我很困惑。 How could the GPU be even slower than CPU on large matrix operations?在大型矩阵运算中，GPU 怎么可能比 CPU 还要慢？ Note that I even have not applied parallel computing (I am a beginner and I am not sure whether the system will open it for me or not.) I did have checked similar questions such as Why is my CPU doing matrix operations faster than GPU instead?请注意，我什至没有应用并行计算（我是初学者，我不确定系统是否会为我打开它。）我确实检查过类似的问题，例如为什么我的 CPU 执行矩阵运算比 GPU 快? . . But here I am using cupy rather than mxnet (cupy is newer and designed for GPU computing).但在这里我使用的是cupy而不是mxnet （cupy是更新的，专为GPU计算而设计）。

Can someone help?有人可以帮忙吗？ I woud really appreciate!我真的很感激！

Answer 1

numpy random is generating floats (32bit) as default. numpy random 默认生成浮点数（32 位）。 Cupy random generates 64bit (double) by default.默认情况下，Cupy random 生成 64 位（双精度）。 To make an apples to apples comparison, change the GPU random number generation like this:要进行苹果与苹果的比较，请像这样更改 GPU 随机数生成：

C= cp.random.random([10000,10000], dtype=cp.float32)
D = cp.random.random([10000,10000], dtype=cp.float32)

I have different hardware (both CPU and GPU) than you, but once this change is made the GPU version is about 12x faster than cpu version.我的硬件（CPU 和 GPU）与您不同，但是一旦进行了此更改，GPU 版本的速度将比 cpu 版本快 12 倍左右。 Generating both ndarray of random numbers, matrix multiplication and scalar multiplication using cupy takes less than one second in total使用cupy生成随机数、矩阵乘法和标量乘法的ndarray总共需要不到一秒

为什么我的 GPU 在矩阵运算中比 CPU 慢？

问题描述

1 个解决方案

解决方案1
5 已采纳 2020-10-18 07:06:34

为什么我的 GPU 在矩阵运算中比 CPU 慢？

问题描述

1 个解决方案

解决方案1 5 已采纳 2020-10-18 07:06:34

解决方案1
5 已采纳 2020-10-18 07:06:34