多阵列共享内存上的CUDA减少

Question

I am currently using the following Reduction function to sum all of the elements in an array with CUDA: 我目前正在使用以下归约函数对CUDA数组中的所有元素求和：

__global__ void reduceSum(int *input, int *input2, int *input3, int *outdata, int size){
    extern __shared__ int sdata[];

    unsigned int tID = threadIdx.x;
    unsigned int i = tID + blockIdx.x * (blockDim.x * 2);
    sdata[tID] = input[i] + input[i + blockDim.x];
    __syncthreads();

    for (unsigned int stride = blockDim.x / 2; stride > 32; stride >>= 1)
    {
        if (tID < stride)
        {
            sdata[tID] += sdata[tID + stride];
        }
        __syncthreads();
    }

    if (tID < 32){ warpReduce(sdata, tID); }

    if (tID == 0)
    {
        outdata[blockIdx.x] = sdata[0];
    }
}

However, as you can see from the function parameters I would like to be able to sum three separate arrays inside the one reduction function. 但是，从函数参数可以看出，我希望能够对一个归约函数内的三个独立的数组求和。 Now obviously a simple way to do this would be to launch the Kernel three times and pass a different array each time, and this would work fine of course. 现在，显然，执行此操作的一个简单方法是启动内核3次并每次传递不同的数组，这当然可以正常工作。 I am only writing this as a test kernel for just now though, the real kernel will end up taking an array of structs, and I will need to perform an addition for the all X, Y and Z values of each struct, which is why I need to sum them all in one kernel. 我现在只是将其编写为测试内核，真正的内核最终将接受一个结构数组，并且我将需要对每个结构的所有X，Y和Z值执行加法运算，这就是为什么我需要将它们全部汇总在一个内核中。

I have initalised and allocated memory for all three arrays 我已经为所有三个阵列初始化并分配了内存

    int test[1000];
    std::fill_n(test, 1000, 1);
    int *d_test;

    int test2[1000];
    std::fill_n(test2, 1000, 2);
    int *d_test2;

    int test3[1000];
    std::fill_n(test3, 1000, 3);
    int *d_test3;

    cudaMalloc((void**)&d_test, 1000 * sizeof(int));
    cudaMalloc((void**)&d_test2, 1000 * sizeof(int));
    cudaMalloc((void**)&d_test3, 1000 * sizeof(int));

I am unsure what Grid and Block dimensions I should use for this kind of kernel and I am not entirely sure how to modify the reduction loop to place the data as I want it, ie Output Array: 我不确定应该为这种内核使用什么网格和块尺寸，我也不完全确定如何修改缩小循环以将数据放置在我想要的位置，即输出数组：

Block 1 Result|Block 2 Result|Block 3 Result|Block 4 Result|Block 5 Result|Block 6 Result|

      Test Array 1 Sums              Test Array 2 Sums            Test Array 3 Sums

I hope that makes sense. 我希望这是有道理的。 Or is there a better way to have only one reduction function but be able to return the summation of Struct.X, Struct.Y or struct.Z? 还是有更好的方法只具有一个归约函数但能够返回Struct.X，Struct.Y或struct.Z的总和？

Here's the struct: 这是结构：

template <typename T>
struct planet {
    T x, y, z;
    T vx, vy, vz;
    T mass;
};

I need to add up all the VX and store it, all the VY and store it and all the VZ and store it. 我需要加总所有VX并将其存储，所有VY并将其存储以及所有VZ并将其存储。

Answer 1

Or is there a better way to have only one reduction function but be able to return the summation of Struct.X, Struct.Y or struct.Z? 还是有更好的方法只具有一个归约函数但能够返回Struct.X，Struct.Y或struct.Z的总和？

Usually a principal focus of accelerated computing is speed. 通常，加速计算的主要重点是速度。 Speed (performance) of GPU codes often depends heavily on data storage and access patterns. GPU代码的速度（性能）通常在很大程度上取决于数据存储和访问模式。 Therefore, although as you point out in your question we could realize a solution in a number of ways, let's focus on something that should be relatively fast. 因此，尽管正如您在问题中指出的那样，我们可以通过多种方式实现解决方案，但让我们集中精力研究相对较快的问题。

Reductions like this don't have much arithmetic/operation intensity, so our focus for performance will mostly revolve around data storage for efficient access. 这样的归约没有太多的运算/运算强度，因此我们的性能重点将主要围绕数据存储以实现有效访问。 When accessing global memory, GPUs will typically do so in large chunks -- 32 byte or 128 byte chunks. 访问全局内存时，GPU通常会大块地（32字节或128字节大块）进行访问。 To make efficient use of the memory subsystem, we'll want to use all 32 or 128 of those bytes that are requested, on each request. 为了有效利用内存子系统，我们希望在每个请求中使用所有32或128个被请求的字节。

But the implied data storage pattern of your structure: 但是结构的隐式数据存储模式：

template <typename T>
struct planet {
    T x, y, z;
    T vx, vy, vz;
    T mass;
};

pretty much rules this out. 几乎排除了这一点。 For this problem you care about vx , vy , and vz . 对于这个问题，您关心vx ， vy和vz 。 Those 3 items should be contiguous within a given structure (element), but in an array of those structures, they will be separated by the necessary storage for the other structure items, at least: 这三个项目在给定的结构（元素）中应该是连续的，但是在这些结构的数组中，它们将被其他结构项目的必要存储区分开，至少：

planet0:       T x
               T y
               T z               ---------------
               T vx      <--           ^
               T vy      <--           |
               T vz      <--       32-byte read
               T mass                  |
planet1:       T x                     |
               T y                     v
               T z               ---------------
               T vx      <--
               T vy      <--
               T vz      <--
               T mass
planet2:       T x
               T y
               T z
               T vx      <--
               T vy      <--
               T vz      <--
               T mass

(for the sake of example, assuming T is float ) （为示例起见，假设T为float ）

This points out a key drawback of Array of Structures (AoS) storage formats in a GPU. 这指出了GPU中的结构阵列 （AoS）存储格式的主要缺点。 Accessing the same element from consecutive structures is inefficent, due to the access granularity (32-byte) of the GPU. 由于GPU的访问粒度（32字节），从连续结构访问同一元素效率很低。 The usual suggestion for performance in such cases is to convert the AoS storage to SoA (structure of arrays): 在这种情况下，通常的性能建议是将AoS存储转换为SoA（阵列结构）：

template <typename T>
struct planets {
    T x[N], y[N], z[N];
    T vx[N], vy[N], vz[N];
    T mass[N];
};

The above is just one possible example, probably not what you would actually use, as the structure would serve little purpose, since we would only have one structure for N planets. 上面只是一个可能的例子，可能不是您实际使用的例子，因为这种结构几乎没有用，因为我们只有N行星的一个结构。 The point is, now when I access vx for consecutive planets, the individual vx elements are all adjacent in memory, so a 32-byte read gives me 32 bytes worth of vx data, with no wasted or unused elements. 关键是，现在当我访问vx连续行星，各个vx元件都在相邻的存储器，所以32字节的读出给我32字节值得的vx数据，没有浪费的或不使用的元件。

With such a transformation, the reduction problem becomes relatively simple again, from the standpoint of code organization. 通过这种转换，从代码组织的角度来看，简化问题再次变得相对简单。 You can use essentially the same as your single array reduction code, either called 3 times in a row or else with a straightforward extension to the kernel code to essentially handle all 3 arrays independently. 您可以使用与单个数组精简代码基本相同的方法，既可以连续调用3次，也可以直接对内核代码进行扩展，以本质上独立地处理所有3个数组。 A "3-in-1" kernel might look something like this: “三合一”内核可能看起来像这样：

template <typename T>
__global__ void reduceSum(T *input_vx, T *input_vy, T *input_vz, T *outdata_vx, T *outdata_vy, T *outdata_vz, int size){
    extern __shared__ T sdata[];

    const int VX = 0;
    const int VY = blockDim.x;
    const int VZ = 2*blockDim.x;

    unsigned int tID = threadIdx.x;
    unsigned int i = tID + blockIdx.x * (blockDim.x * 2);
    sdata[tID+VX] = input_vx[i] + input_vx[i + blockDim.x];
    sdata[tID+VY] = input_vy[i] + input_vy[i + blockDim.x];
    sdata[tID+VZ] = input_vz[i] + input_vz[i + blockDim.x];
    __syncthreads();

    for (unsigned int stride = blockDim.x / 2; stride > 32; stride >>= 1)
    {
        if (tID < stride)
        {
            sdata[tID+VX] += sdata[tID+VX + stride];
            sdata[tID+VY] += sdata[tID+VY + stride];
            sdata[tID+VZ] += sdata[tID+VZ + stride];
        }
        __syncthreads();
    }

    if (tID < 32){ warpReduce(sdata+VX, tID); }
    if (tID < 32){ warpReduce(sdata+VY, tID); }
    if (tID < 32){ warpReduce(sdata+VZ, tID); }

    if (tID == 0)
    {
        outdata_vx[blockIdx.x] = sdata[VX];
        outdata_vy[blockIdx.x] = sdata[VY];
        outdata_vz[blockIdx.x] = sdata[VZ];
    }
}

(coded in browser - not tested - merely an extension of what you have shown as a "reference kernel") （在浏览器中编码-未经测试-只是对您显示为“参考内核”的扩展）

The above AoS -> SoA data transformation will likely have performance benefits elsewhere in your code as well. 上面的AoS-> SoA数据转换也可能会在代码中的其他地方带来性能优势。 Since the proposed kernel will handle 3 arrays at once, the grid and block dimensions should be exactly the same as what you would use for your reference kernel in the single-array case. 由于建议的内核将一次处理3个阵列，因此网格和块尺寸应与在单阵列情况下用于参考内核的尺寸完全相同 。 Shared memory storage will need to increase (triple) per block. 每个块的共享内存存储将需要增加（三倍）。

Answer 2

Robert Crovella gave an excellent answer that highlights the importance of the AoS -> SoA layout transformation that often improves performance on the GPU, I'd just like to propose a middle ground that might be more convenient. Robert Crovella给出了一个很好的答案，突出了AoS-> SoA布局转换的重要性，该转换通常可以提高GPU的性能，我只想提出一个可能更方便的中间立场。 The CUDA language provides a few vector types for just the purpose you describe (see this section of the CUDA programming guide ). CUDA语言提供了几种矢量类型，仅用于您描述的目的（请参阅CUDA编程指南的这一部分）。

For example, CUDA defines int3, a datatype that stores 3 integers. 例如，CUDA定义了int3，这是一种存储3个整数的数据类型。

 struct int3
 {
    int x; int y; int z;
 };

Similar types exist for floats, chars, doubles etc. What's nice about these datatypes is that they can be loaded with a single instruction, which may give you a small performance boost. 浮点数，字符，双精度数等也存在类似的类型。这些数据类型的优点是可以用一条指令加载它们，这可能会给您带来很小的性能提升。 See this NVIDIA blog post for a discussion of this. 有关此问题的讨论，请参见NVIDIA博客文章。 It's also a more "natural" datatype for this case, and it might make other parts of your code easier to work with. 在这种情况下，它也是一种更“自然”的数据类型，它可能使代码的其他部分更易于使用。 You could define, for example: 您可以定义，例如：

struct planets {
    float3 position[N];
    float3 velocity[N];
    int mass[N];
};

A reduction kernel that uses this datatype might look something like this (adapted from Robert's). 使用此数据类型的归约内核可能看起来像这样（改编自Robert's）。

__inline__ __device__ void SumInt3(int3 const & input1, int3 const & input2, int3 & result)
{
    result.x = input1.x + input2.x;
    result.y = input1.y + input2.y;
    result.z = input1.z + input2.z;
}

__inline__ __device__ void WarpReduceInt3(int3 const & input, int3 & output, unsigned int const tID)
{
    output.x = WarpReduce(input.x, tID);
    output.y = WarpReduce(input.y, tID);
    output.z = WarpReduce(input.z, tID);    
}

__global__ void reduceSum(int3 * inputData, int3 * output, int size){
    extern __shared__ int3 sdata[];

    int3 temp;

    unsigned int tID = threadIdx.x;
    unsigned int i = tID + blockIdx.x * (blockDim.x * 2);

    // Load and sum two integer triplets, store the answer in temp.
    SumInt3(input[i], input[i + blockDim.x], temp);

    // Write the temporary answer to shared memory.
    sData[tID] = temp;

    __syncthreads();

    for (unsigned int stride = blockDim.x / 2; stride > 32; stride >>= 1)
    {
        if (tID < stride)
        {
            SumInt3(sdata[tID], sdata[tID + stride], temp);
            sData[tID] = temp;
        }
        __syncthreads();
    }

    // Sum the intermediate results accross a warp.
    // No need to write the answer to shared memory,
    // as only the contribution from tID == 0 will matter.
    if (tID < 32)
    {
        WarpReduceInt3(sdata[tID], tID, temp);
    }

    if (tID == 0)
    {
        output[blockIdx.x] = temp;
    }
}

多阵列共享内存上的CUDA减少

问题描述

2 个解决方案

解决方案1
4 2016-02-27 01:51:41

解决方案2
1 2016-02-27 16:42:54

多阵列共享内存上的CUDA减少

问题描述

2 个解决方案

解决方案1 4 2016-02-27 01:51:41

解决方案2 1 2016-02-27 16:42:54

解决方案1
4 2016-02-27 01:51:41

解决方案2
1 2016-02-27 16:42:54