[英]Why is cuFFT so slow?
I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. 我希望加速计算机视觉应用程序,该应用程序可以在Intel CPU上使用FFTW和OpenMP计算许多FFT。 However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP.
但是,对于各种FFT问题大小,我发现cuFFT比使用OpenMP的FFTW慢。
In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. 在下面的实验和讨论中,我发现对于批量2D FFT,cuFFT比FFTW 慢 。 Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster?
为什么cuFFT这么慢,我能做些什么来使cuFFT运行得更快?
Our computer vision application requires a forward FFT on a bunch of small planes of size 256x256. 我们的计算机视觉应用程序需要在一堆大小为256x256的小平面上进行正向FFT。 I'm running the FFTs on on HOG features with a depth of 32, so I use the batch mode to do 32 FFTs per function call.
我在深度为32的HOG功能上运行FFT,因此我使用批处理模式对每个函数调用执行32 FFT。 Typically, I do about 8 FFT function calls of size 256x256 with a batch size of 32.
通常,我会执行8个256x256大小的FFT函数调用,批处理大小为32。
FFTW + OpenMP FFTW + OpenMP
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU
. 下面的代码执行在16.0ms的上
Intel i7-2600 8-core CPU
。
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL
float* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);
memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);
fftwf_complex* h_freq = reinterpret_cast<fftwf_complex*>(h_in); //in-place version
fftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank
n, //dims -- this doesn't include zero-padding
depth, //howmany
h_in, //in
inembed, //inembed
depth, //istride
1, //idist
h_freq, //out
onembed, //onembed
depth, //ostride
1, //odist
FFTW_PATIENT /*flags*/);
double start = read_timer();
#pragma omp parallel for
for(int i=0; i<nIter; i++){
fftwf_execute_dft_r2c(forwardPlan, h_in, h_freq);
}
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
cuFFT 傅立叶变换
The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU
. 以下代码在顶级
NVIDIA K20 GPU
上以21.7ms执行。 Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently . 请注意,即使我使用流, cuFFT也不会同时运行多个FFT 。
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL in FFTW
cufftHandle forwardPlan;
float* d_in; cufftComplex* d_freq;
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
2, //rank
n, //dimensions = {nRows, nCols}
inembed, //inembed
depth, //istride
1, //idist
onembed, //onembed
depth, //ostride
1, //odist
CUFFT_R2C, //cufftType
depth /*batch*/));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
d_freq = reinterpret_cast<cufftComplex*>(d_in);
double start = read_timer();
for(int i=0; i<nIter; i++){
CHECK_CUFFT(cufftExecR2C(forwardPlan, d_in, d_freq));
}
CHECK_CUDART(cudaDeviceSynchronize());
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
Other notes 其他注意事项
cudaMemcpy
s between the CPU and GPU are not included in my computation time. cudaMemcpy
。 nvvp
profiler, some sizes like 1024x1024 are able to fully saturate the GPU. nvvp
分析器,某些大小(例如nvvp
能够完全饱和GPU。 But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT. Question might be outdated, though here is a possible explanation (for the slowness of cuFFT). 问题可能已经过时了,尽管这是一个可能的解释(因为cuFFT的缓慢性)。
When structuring your data for cufftPlanMany
, the data arrangement is not very nice with the GPU. 在为
cufftPlanMany
构造数据时, cufftPlanMany
的数据排列不是很好。 Indeed, using an istride and ostride of 32 means no data read is coalesced. 实际上,使用32的istride和ostride意味着不会合并任何读取的数据。 See here for details on the read pattern
有关读取模式的详细信息,请参见此处
input[b * idist + (x * inembed[1] + y) * istride]
output[b * odist + (x * onembed[1] + y) * ostride]
in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. 在这种情况下,如果i / ostride为32,则合并/优化的可能性很小。 (indeed
b
is the batch number). (实际上
b
是批号)。 Here are the changes I applied: 这是我应用的更改:
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
2, //rank
n, //dimensions = {nRows, nCols}
inembed, //inembed
1, // WAS: depth, //istride
nRows*cols_padded, // WAS: 1, //idist
onembed, //onembed
1, // WAS: depth, //ostride
nRows*cols_padded, // WAS:1, //odist
CUFFT_R2C, //cufftType
depth /*batch*/));
Running this, I entered a unspecified launch failure because of illegal memory access. 运行此命令,由于非法内存访问,我进入了未指定的启动失败。 You might want to change the memory allocation (
cufftComplex
is two floats, you need an x2 in your allocation size - looks like a typo). 您可能要更改内存分配(
cufftComplex
是两个浮点数,您的分配大小需要一个cufftComplex
看起来像一个错字)。
// WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2));
When running it this way, I got a x8 performance improvement on my card. 当以这种方式运行时,我的卡的性能提高了x8。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.