简体   繁体   English

有什么方法可以减少CUDA中数组的总100M float元素?

[英]Is there any way to reduce sum 100M float elements of an array in CUDA?

I'm new to CUDA. 我是CUDA的新手。 So please bear with questions with trivial solutions, if any. 因此,如果有任何简单的解决方案,请提出疑问。

I am trying to find the sum of 100M float elements of an array. 我试图找到一个数组的100M浮点元素的总和。 From the following code one could see that I've used a reduction kernel and thrust. 从以下代码中,可以看到我使用了归约内核和推力。 I suppose the kernel stores the sum in g_odata[0] . 我想内核将和存储在g_odata[0] As all the elements are same in g_idata the result should be n*g_idata[1] . 由于g_idata的所有元素都相同,因此结果应为n*g_idata[1] But you could clearly see the results are incorrect for both of them. 但是您可以清楚地看到两个结果都不正确。

  1. What am I getting wrong? 我怎么了? How could I achieve my target? 我如何实现目标?
  2. Every reduction kernel I found is for integer datatype. 我发现的每个归约内核都是整数数据类型。 eg the highly recommended Optimizing Parallel Reduction in CUDA. 例如,强烈建议在CUDA中优化并行减少。 . Is there any specific reason to that? 有什么具体原因吗?

Here is my code: 这是我的代码:

    #include <iostream>
    #include <math.h>
    #include <stdlib.h>
    #include <iomanip>
    #include <thrust/reduce.h>
    #include <thrust/execution_policy.h>


    using namespace std;


    __global__ void reduce(float *g_idata, float *g_odata) {

    __shared__ float sdata[256];


    int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[threadIdx.x] = g_idata[i];

    __syncthreads();

    for (int s=1; s < blockDim.x; s *=2)
    {
        int index = 2 * s * threadIdx.x;;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }


    if (threadIdx.x == 0)
        atomicAdd(g_odata,sdata[0]);
    }




    int main(void){

    unsigned int n=pow(10,8);
    float *g_idata, *g_odata;

    cudaMallocManaged(&g_idata, n*sizeof(float));
    cudaMallocManaged(&g_odata, n*sizeof(float));

    int blockSize = 32;
    int numBlocks = (n + blockSize - 1) / blockSize;

    for(int i=0;i<n;i++){g_idata[i]=6.1;g_odata[i]=0;}


    reduce<<<numBlocks, blockSize>>>(g_idata, g_odata);
    cudaDeviceSynchronize();


    cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;

    g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);

    cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;



    cudaFree(g_idata);
    cudaFree(g_odata);

    }

Result: 结果:

6.0129e+08  6.1e+08 8.7097e+06
6.09986e+08 6.1e+08 13824

I am using CUDA 10. nvcc --version : 我正在使用CUDA 10. nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Details of my GPU DeviceQuery : 我的GPU DeviceQuery详细信息:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 750"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 1999 MBytes (2096168960 bytes)
  ( 4) Multiprocessors, (128) CUDA Cores/MP:     512 CUDA Cores
  GPU Max Clock rate:                            1110 MHz (1.11 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

Thanks in advance. 提前致谢。

I think the reason you are confused about the results here is a lack of understanding of floating point arithmetic. 我认为您对这里的结果感到困惑的原因是缺乏对浮点运算的理解。 This whitepaper covers the topic pretty well. 本白皮书很好地涵盖了该主题。 As a simple concept to grasp, if I have numbers represented as float quantities, and I attempt to do this: 举一个简单的概念来把握,如果我有表示为数字float量,我试图做到这一点:

100000000 + 1 100000000 + 1

the result will be: 100000000 (write some code and try it yourself) 结果将是:100000000(编写一些代码,然后自己尝试)

This isn't unique to GPUs, CPU code will behave the same way (try it). 这并非GPU独有,CPU代码的行为方式相同(尝试)。

So for very large reductions, we get to the point (often) where we are adding very large numbers to much much smaller numbers, and the results aren't accurate from a "pure math" point of view. 因此,对于非常大的缩减,我们到了(通常)将非常大的数字加到非常小的数字上的地步,从“纯数学”的角度来看,结果并不准确。

That is fundamentally the problem here. 从根本上讲,这就是问题所在。 In your CPU code, when you decide that the correct result should be 6.1*n, that kind of multiplication problem is not subject to the limits of adding large numbers to small ones that I just described, so you get an "accurate" result from that. 在您的CPU代码中,当您确定正确的结果应为6.1 * n时,这种乘法问题不受限于我刚才所描述的将大数与小数相加的限制,因此您可以从中获得“准确”的结果那。

One of the ways to prove this or work around it, is to use double representation instead of float . 证明或解决此问题的方法之一是使用double表示而不是float This doesn't really completely eliminate the problem, but it pushes the resolution to the point where it can do a much better job of representing the range of numbers here. 这并不能完全消除问题,但是可以将分辨率提高到可以更好地表示此处数字范围的程度。

The following code primarily has that change. 以下代码主要进行了更改。 You can change the typedef to compare the behavior between float and double . 您可以更改typedef以比较floatdouble的行为。

There are a few other changes in the code. 代码中还有其他一些更改。 None of them are the cause of the discrepancy you witnessed. 它们都不是您看到的差异的原因。

$ cat t18.cu
    #include <iostream>
    #include <math.h>
    #include <stdlib.h>
    #include <iomanip>
    #include <thrust/reduce.h>
    #include <thrust/execution_policy.h>

    #define BLOCK_SIZE 32
    typedef double ft;
    using namespace std;

    __device__ double my_atomicAdd(double* address, double val)
    {
      unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
      unsigned long long int old = *address_as_ull, assumed;

      do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));

      // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
      } while (assumed != old);

      return __longlong_as_double(old);
    }
    __device__ float my_atomicAdd(float* addr, float val){
        return atomicAdd(addr, val);
    }

    __global__ void reduce(ft *g_idata, ft *g_odata, int n) {

    __shared__ ft sdata[BLOCK_SIZE];

    int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[threadIdx.x] = (i < n)?g_idata[i]:0;

    __syncthreads();

    for (int s=1; s < blockDim.x; s *=2)
    {
        int index = 2 * s * threadIdx.x;;

        if ((index +s) < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }


    if (threadIdx.x == 0)
        my_atomicAdd(g_odata,sdata[0]);
    }




    int main(void){

    unsigned int n=pow(10,8);

    ft *g_idata, *g_odata;

    cudaMallocManaged(&g_idata, n*sizeof(ft));
    cudaMallocManaged(&g_odata, sizeof(ft));
    cout << "n = " << n << endl;
    int blockSize = BLOCK_SIZE;
    int numBlocks = (n + blockSize - 1) / blockSize;
    g_odata[0] = 0;
    for(int i=0;i<n;i++){g_idata[i]=6.1;}


    reduce<<<numBlocks, blockSize>>>(g_idata, g_odata, n);
    cudaDeviceSynchronize();


    cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;

    g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);

    cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;



    cudaFree(g_idata);
    cudaFree(g_odata);

    }
$ nvcc -o t18 t18.cu
$ cuda-memcheck ./t18
========= CUDA-MEMCHECK
n = 100000000
6.1e+08 6.1e+08 0.00527966
6.1e+08 6.1e+08 5.13792e-05
========= ERROR SUMMARY: 0 errors
$

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM