为什么cuFFT这么慢？

Question

I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. 我希望加速计算机视觉应用程序，该应用程序可以在Intel CPU上使用FFTW和OpenMP计算许多FFT。 However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP. 但是，对于各种FFT问题大小，我发现cuFFT比使用OpenMP的FFTW慢。

In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. 在下面的实验和讨论中，我发现对于批量2D FFT，cuFFT比FFTW 慢。 Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster? 为什么cuFFT这么慢，我能做些什么来使cuFFT运行得更快？

Experiments ( code download ) 实验（代码下载）

Our computer vision application requires a forward FFT on a bunch of small planes of size 256x256. 我们的计算机视觉应用程序需要在一堆大小为256x256的小平面上进行正向FFT。 I'm running the FFTs on on HOG features with a depth of 32, so I use the batch mode to do 32 FFTs per function call. 我在深度为32的HOG功能上运行FFT，因此我使用批处理模式对每个函数调用执行32 FFT。 Typically, I do about 8 FFT function calls of size 256x256 with a batch size of 32. 通常，我会执行8个256x256大小的FFT函数调用，批处理大小为32。

FFTW + OpenMP FFTW + OpenMP
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU . 下面的代码执行在16.0ms的上Intel i7-2600 8-core CPU 。

int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};

//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL

float* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);
memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);
fftwf_complex* h_freq = reinterpret_cast<fftwf_complex*>(h_in); //in-place version

fftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank
                                                 n, //dims -- this doesn't include zero-padding
                                                 depth, //howmany
                                                 h_in, //in
                                                 inembed, //inembed
                                                 depth, //istride
                                                 1, //idist
                                                 h_freq, //out
                                                 onembed, //onembed
                                                 depth, //ostride
                                                 1, //odist
                                                 FFTW_PATIENT /*flags*/);
double start = read_timer();
#pragma omp parallel for
for(int i=0; i<nIter; i++){
    fftwf_execute_dft_r2c(forwardPlan, h_in, h_freq);
}
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);

cuFFT 傅立叶变换
The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU . 以下代码在顶级NVIDIA K20 GPU上以21.7ms执行。 Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently . 请注意，即使我使用流， cuFFT也不会同时运行多个FFT 。

int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};

int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL in FFTW
cufftHandle forwardPlan;
float* d_in; cufftComplex* d_freq;
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
              2, //rank
              n, //dimensions = {nRows, nCols}
              inembed, //inembed
              depth, //istride
              1, //idist
              onembed, //onembed
              depth, //ostride
              1, //odist
              CUFFT_R2C, //cufftType
              depth /*batch*/));

CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
d_freq = reinterpret_cast<cufftComplex*>(d_in);

double start = read_timer();
for(int i=0; i<nIter; i++){

    CHECK_CUFFT(cufftExecR2C(forwardPlan, d_in, d_freq));
}
CHECK_CUDART(cudaDeviceSynchronize());
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);

Other notes 其他注意事项

In the GPU version, cudaMemcpy s between the CPU and GPU are not included in my computation time. 在GPU版本中，我的计算时间不包括 CPU和GPU之间的cudaMemcpy 。
The performance numbers presented here are averages of several experiments, where each experiment has 8 FFT function calls (total of 10 experiments, so 80 FFT function calls). 此处显示的性能数字是几个实验的平均值，其中每个实验有8个FFT函数调用（总共10个实验，因此有80个FFT函数调用）。
I've tried many problem sizes (eg 128x128, 256x256, 512x512, 1024x1024), all with depth=32. 我尝试了许多问题大小（例如128x128、256x256、512x512、1024x1024），所有问题的深度均为32。 Based on the nvvp profiler, some sizes like 1024x1024 are able to fully saturate the GPU. 基于nvvp分析器，某些大小（例如nvvp能够完全饱和GPU。 But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT. 但是，对于所有这些大小，CPU FFTW + OpenMP比cuFFT更快。

Answer 1

Question might be outdated, though here is a possible explanation (for the slowness of cuFFT). 问题可能已经过时了，尽管这是一个可能的解释（因为cuFFT的缓慢性）。

When structuring your data for cufftPlanMany , the data arrangement is not very nice with the GPU. 在为cufftPlanMany构造数据时， cufftPlanMany的数据排列不是很好。 Indeed, using an istride and ostride of 32 means no data read is coalesced. 实际上，使用32的istride和ostride意味着不会合并任何读取的数据。 See here for details on the read pattern 有关读取模式的详细信息，请参见此处

input[b * idist + (x * inembed[1] + y) * istride]
output[b * odist + (x * onembed[1] + y) * ostride]

in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. 在这种情况下，如果i / ostride为32，则合并/优化的可能性很小。 (indeed b is the batch number). （实际上b是批号）。 Here are the changes I applied: 这是我应用的更改：

    CHECK_CUFFT(cufftPlanMany(&forwardPlan,
              2, //rank
              n, //dimensions = {nRows, nCols}
              inembed, //inembed
              1,  // WAS: depth, //istride
              nRows*cols_padded, // WAS: 1, //idist
              onembed, //onembed
              1, // WAS: depth, //ostride
              nRows*cols_padded, // WAS:1, //odist
              CUFFT_R2C, //cufftType
              depth /*batch*/));

Running this, I entered a unspecified launch failure because of illegal memory access. 运行此命令，由于非法内存访问，我进入了未指定的启动失败。 You might want to change the memory allocation ( cufftComplex is two floats, you need an x2 in your allocation size - looks like a typo). 您可能要更改内存分配（ cufftComplex是两个浮点数，您的分配大小需要一个cufftComplex看起来像一个错字）。

// WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth)); 
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2));

When running it this way, I got a x8 performance improvement on my card. 当以这种方式运行时，我的卡的性能提高了x8。

为什么cuFFT这么慢？

问题描述

Experiments ( code download ) 实验（代码下载）

1 个解决方案

解决方案1
5 2016-04-26 16:47:14

为什么cuFFT这么慢？

问题描述

Experiments ( code download ) 实验（ 代码下载 ）

1 个解决方案

解决方案1 5 2016-04-26 16:47:14

Experiments ( code download ) 实验（代码下载）

解决方案1
5 2016-04-26 16:47:14