简体   繁体   English

CUDA内存分配性能

[英]CUDA memory allocation performance

I'm working with image filters on CUDA. 我正在使用CUDA上的图像滤镜。 Image processing is much faster than it is on CPU. 图像处理比CPU上的处理要快得多。 But the problem is that the allocation of the image is really slow. 但是问题在于图像的分配确实很慢。

That is how I allocate memory and set the image. 这就是我分配内存和设置映像的方式。

hr = cudaMalloc(&m_device.originalImage,    size);                                                                          
hr = cudaMalloc(&m_device.modifiedImage,    size);                                                                          
hr = cudaMalloc(&m_device.tempImage,    size);                                                                  
hr = cudaMemset( m_device.modifiedImage, 0, size);                                                                          
hr = cudaMemcpy( m_device.originalImage, host.originalImage, size, cudaMemcpyHostToDevice); 

And here is the result of executing the program. 这是执行程序的结果。

C:\cpu_gpu_filters(GPU)\x64\Release>cpu_gpu_filters test-case.txt
C:\Users\Max\Desktop\test_set\cheshire_cat_1280x720.jpg
Init time: 519 ms
Time spent: 2.35542 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_1366x768.jpg
Init time: 31 ms
Time spent: 2.68595 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_1600x900.jpg
Init time: 44 ms
Time spent: 3.54835 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_1920x1080.jpg
Init time: 61 ms
Time spent: 4.98131 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_2560x1440.jpg
Init time: 107 ms
Time spent: 9.0727 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_3840x2160.jpg
Init time: 355 ms
Time spent: 20.1453 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_5120x2880.jpg
Init time: 449 ms
Time spent: 35.815 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_7680x4320.jpg
Init time: 908 ms
Time spent: 75.4647 ms

UPD Code with time measuring: 具有时间测量功能的UPD代码:

start = high_resolution_clock::now();
Initialize();
stop = high_resolution_clock::now();
long long ms = duration_cast<milliseconds>(stop - start).count();
long long us = duration_cast<microseconds>(stop - start).count();
cout << "Init time: " << ms << " ms" << endl;


start = high_resolution_clock::now();
GpuTimer gpuTimer;
gpuTimer.Start();
RunGaussianBlurKernel(
    m_device.modifiedImage,
    m_device.tempImage,
    m_device.originalImage, 
    m_device.filter,
    m_filter.width,
    m_host.originalImage.rows, 
    m_host.originalImage.cols
    );
gpuTimer.Stop();

The first image is the smallest, but initialization takes 519 ms. 第一张图片最小,但初始化需要519毫秒。 Maybe, it's because of the necessity to load the drivers or something. 也许是因为有必要加载驱动程序之类的东西。 Then, when the size of the image increases, initialization time increases as well. 然后,当图像的尺寸增加时,初始化时间也增加。 Actually, this looks logical, but I'm still not sure that initialization process should be that slow. 实际上,这看起来很合逻辑,但是我仍然不确定初始化过程应该这么慢。 Am I doing something wrong? 难道我做错了什么?

In your unit code, you have a cudaMemset which execution time depends on size. 在您的单元代码中,您有一个cudaMemset,其执行时间取决于大小。 There is also the cudaMemcpy, which execution time is approximately given by the mem copy size in bytes divided by the bandwidth of the PCI-Express. 还有cudaMemcpy,它的执行时间大约由mem复制大小(以字节为单位)除以PCI-Express的带宽所给定。 It is very likely that this part is responsible for the increase in init time. 这很可能是造成初始化时间增加的原因。 Running it through NSIGHT will provide you with more precise figures on execution time. 通过NSIGHT运行它可以为您提供更精确的执行时间数据。 However, without a MCVE, hard to answer for sure. 但是,如果没有MCVE,则很难确定。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM