设备上调用的printf（）的输出不完整

Question

For the purpose of testing printf() call on device, I wrote a simple program which copies an array of moderate size to device and print the value of device array to screen. 为了测试设备上的printf（）调用，我编写了一个简单的程序，该程序将中等大小的数组复制到设备上并将设备数组的值打印到屏幕上。 Although the array is correctly copied to device, the printf() function does not work correctly, which lost the first several hundred numbers. 尽管该数组已正确复制到设备，但printf（）函数无法正常运行，从而丢失了前几百个数字。 The array size in the code is 4096. Is this a bug or I'm not using this function properly? 代码中的数组大小为4096。这是错误还是我没有正确使用此功能？ Thanks in adavnce. 非常感谢。

EDIT: My gpu is GeForce GTX 550i, with compute capability 2.1 编辑：我的GPU是GeForce GTX 550i，具有2.1的计算能力

My code: 我的代码：

#include<stdio.h>
#include<stdlib.h>
#define N 4096

__global__ void Printcell(float *d_Array , int n){
    int k = 0;

    printf("\n=========== data of d_Array on device==============\n");
    for( k = 0; k < n; k++ ){
        printf("%f  ", d_Array[k]);
        if((k+1)%6 == 0) printf("\n");
    }
    printf("\n\nTotally %d elements has been printed", k);
}

int main(){

    int i =0;

    float Array[N] = {0}, rArray[N] = {0};
    float *d_Array;
    for(i=0;i<N;i++)
        Array[i] = i;


    cudaMalloc((void**)&d_Array, N*sizeof(float));
    cudaMemcpy(d_Array, Array, N*sizeof(float), cudaMemcpyHostToDevice);
    cudaDeviceSynchronize();
    Printcell<<<1,1>>>(d_Array, N);    //Print the device array by a kernel
    cudaDeviceSynchronize();

    /* Copy the device array back to host to see if it was correctly copied */   
    cudaMemcpy(rArray, d_Array, N*sizeof(float), cudaMemcpyDeviceToHost);

    printf("\n\n");

    for(i=0;i<N;i++){
        printf("%f  ", rArray[i]);
        if((i+1)%6 == 0) printf("\n");
    }
}

Answer 1

printf from the device has a limited queue. 设备上的printf队列有限。 It's intended for small scale debug-style output, not large scale output. 它用于小型调试样式的输出，而不是大型输出。

referring to the programmer's guide : 参考程序员指南：

The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). 在内核启动之前，将printf（）的输出缓冲区设置为固定大小（请参阅关联的主机端API）。 It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten. 它是循环的，并且如果内核执行过程中产生的输出超出缓冲区的容量，则较旧的输出将被覆盖。

Your in-kernel printf output overran the buffer, and so the first printed elements were lost (overwritten) before the buffer was dumped into the standard I/O queue. 内核中的printf输出覆盖了缓冲区，因此在将缓冲区转储到标准I / O队列之前，第一个打印的元素丢失（覆盖）。

The linked documentation indicates that the buffer size can be increased, also. 链接的文档指出缓冲区大小也可以增加。

设备上调用的printf（）的输出不完整

问题描述

1 个解决方案

解决方案1
12 已采纳 2013-03-14 23:12:01

设备上调用的printf（）的输出不完整

问题描述

1 个解决方案

解决方案1 12 已采纳 2013-03-14 23:12:01

解决方案1
12 已采纳 2013-03-14 23:12:01