简体   繁体   English

批量1D ifft的CUFFT错误结果

[英]CUFFT wrong result for batch 1D ifft

I am new to CUDA and CUFFT, when I was trying to recover the fft result of cufftExecC2R(...) by applying the corresponding cufftExecC2R(...) , it went wrong, the recovered data and the original data is not identical. 我是CUDA和CUFFT的新手,当我尝试通过应用相应的cufftExecC2R(...)恢复cufftExecC2R(...)的fft结果时,它出错了,恢复的数据与原始数据不相同。

Here is the code, the cuda library I used was cuda-9.0. 这是代码,我使用的cuda库是cuda-9.0。

#include "device_launch_parameters.h"
#include "cuda_runtime.h"
#include "cuda.h"
#include "cufft.h"

#include <iostream>
#include <sys/time.h>
#include <cstdio>
#include <cmath>

using namespace std;

// cuda error check
#define gpuErrchk(ans) {gpuAssrt((ans), __FILE__, __LINE__);}
inline void gpuAssrt(cudaError_t code, const char* file, int line, bool abort=true) {
    if (code != cudaSuccess) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) {

// ifft scale for cufft
__global__ void IFFTScale(int scale_, cufftReal* real) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    real[idx] *= 1.0 / scale_;

void batch_1d_irfft2_test() {
    const int BATCH = 3;
    const int DATASIZE = 4;

    /// RFFT
    // --- Host side input data allocation and initialization
    cufftReal *hostInputData = (cufftReal*)malloc(DATASIZE*BATCH*sizeof(cufftReal));
    for (int i = 0; i < BATCH; ++ i) {
        for (int j = 0; j < DATASIZE; ++ j) {
            hostInputData[i * DATASIZE + j] = (cufftReal)(i * DATASIZE  + j + 1);

    // DEBUG:print host input data
    cout << "print host input data" << endl;
    for (int i = 0; i < BATCH; ++ i) {
        for (int j = 0; j < DATASIZE; ++ j) {
            cout << hostInputData[i * DATASIZE + j] << ", ";
        cout << endl;
    cout << "=====================================================" << endl;

    // --- Device side input data allocation and initialization
    cufftReal *deviceInputData; 
    gpuErrchk(cudaMalloc((void**)&deviceInputData, DATASIZE * BATCH * sizeof(cufftReal)));

    // --- Device side output data allocation
    cufftComplex *deviceOutputData; 
                (DATASIZE / 2 + 1) * BATCH * sizeof(cufftComplex)));

    // Host sice input data copied to Device side 
            DATASIZE * BATCH * sizeof(cufftReal), 

    // --- Batched 1D FFTs
    cufftHandle handle;
    int rank = 1;                           // --- 1D FFTs
    int n[] = {DATASIZE};                 // --- Size of the Fourier transform
    int istride = 1, ostride = 1;           // --- Distance between two successive input/output elements
    int idist = DATASIZE, odist = DATASIZE / 2 + 1; // --- Distance between batches
    int inembed[] = { 0 };                  // --- Input size with pitch (ignored for 1D transforms)
    int onembed[] = { 0 };                  // --- Output size with pitch (ignored for 1D transforms)
    int batch = BATCH;                      // --- Number of batched executions
            inembed, istride, idist, 
            onembed, ostride, odist, 
    cufftExecR2C(handle, deviceInputData, deviceOutputData);

    // **************************************************************************
    /// IRFFT
    cufftReal *deviceOutputDataIFFT; 
    gpuErrchk(cudaMalloc((void**)&deviceOutputDataIFFT, DATASIZE * BATCH * sizeof(cufftReal)));

    // --- Batched 1D IFFTs
    cufftHandle handleIFFT;
    int n_ifft[] = {DATASIZE / 2 + 1};                 // --- Size of the Fourier transform
    idist = DATASIZE / 2 + 1; odist = DATASIZE; // --- Distance between batches
            inembed, istride, idist, 
            onembed, ostride, odist, 
    cufftExecC2R(handleIFFT, deviceOutputData, deviceOutputDataIFFT);

    /* scale
    // dim3 dimGrid(512);
    // dim3 dimBlock(max((BATCH * DATASIZE + 512  - 1) / 512, 1));
    // IFFTScale<<<dimGrid, dimBlock>>>((DATASIZE - 1) * 2, deviceOutputData);

    // host output data for ifft
    cufftReal *hostOutputDataIFFT = (cufftReal*)malloc(DATASIZE*BATCH*sizeof(cufftReal));
            DATASIZE * BATCH * sizeof(cufftReal), 

    // print IFFT recovered host output data
    cout << "print host output IFFT data" << endl;
    for (int i=0; i<BATCH; i++) {
        for (int j=0; j<DATASIZE; j++) {
            cout << hostOutputDataIFFT[i * DATASIZE + j] << ", ";


int main() {

    return 0;

I compile the 'rfft_test.cu' file by nvcc -o rfft_test rfft_test.cu -lcufft . 我通过nvcc -o rfft_test rfft_test.cu -lcufft编译'rfft_test.cu'文件。 the result is as below: 结果如下:

print host input data
1, 2, 3, 4, 
5, 6, 7, 8, 
9, 10, 11, 12, 
print IFFT recovered host output data
6, 8.5359, 15.4641, 0, 
22, 24.5359, 31.4641, 0, 
38, 40.5359, 47.4641, 0, 

Specifically, I check the scale issue for the cufftExecC2R(...) , and I comment out the IFFTScale() kernel function. 具体来说,我检查cufftExecC2R(...)的比例问题,并注释掉IFFTScale()内核函数。 Thus I assume that the recovered output data was like DATASIZE*input_batched_1d_data , but even so, the result is not as expected. 因此,我假设恢复的输出数据类似于DATASIZE*input_batched_1d_data ,但即便如此,结果也不如预期。

I have checked the cufft manual and my code several times, I also search for some Nvidia forums and StackOverflow answers, but I didn't find any solution. 我已经多次检查了袖口手册和我的代码,我也搜索了一些Nvidia论坛和StackOverflow答案,但我没有找到任何解决方案。 Anyone's help is greatly appreciated. 非常感谢任何人的帮助。 Thanks in advance. 提前致谢。

Size of your inverse transform is incorrect and should be DATASIZE not DATASIZE/2+1. 逆变换的大小不正确,应该是DATASIZE而不是DATASIZE / 2 + 1。

Following sections of cuFFT docs should help: 以下部分的cuFFT文档应该有所帮助:

"In C2R mode an input array ( x 1 , x 2 , … , x ⌊ N 2 ⌋ + 1 ) of only non-redundant complex elements is required." “在C2R模式下,只需要非冗余复杂元素的输入数组(x 1,x 2,...,x⌊N2⌋+ 1)。” - N is transform size you pass to plan function -N是您传递给计划功能的变换大小

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM