使用Cuda进行并行尺寸缩减（3D到2D求和）

Question

In a CUDA application, I have an N x N x D matrix that I want to reduce to N x D by summing over the entire first (or second) axis. 在CUDA应用程序中，我有一个N x N x D矩阵，我想通过在整个第一（或第二）轴上求和来简化为N x D xD。 How do I do this most efficiently? 我如何最有效地做到这一点？

Typically, N is greater than 10000 and D is 2 or 3. 通常，N大于10000，D为2或3。

A quick and naive solution using atomicAdd would be the following: 使用atomicAdd的快速而简单的解决方案如下：

namespace kernel {
    __global__ void sumNND(float* devPtrIn, float* devPtrOut, const int N, const int D) {
        int index = blockIdx.x * blockDim.x + threadIdx.x;
        int stride = blockDim.x * gridDim.x;

        for (int id = index; id < N * N * D; id += stride) {
            const unsigned int d = id % D;
            const unsigned int i = (id - d) / D;
            const unsigned int n = i / N;
            const unsigned int m = i % N;

            atomicAdd(&devPtrOut[d + D * n], devPtrIn[d + D * n + N * m]);
        }
    }
}

void sumNND(const int numBlocks, const int blockSize, float* devPtrIn, float* devPtrOut, const int N, const int D) {
    HANDLE_ERROR(cudaMemset(devPtrOut, 0, N * D * sizeof(float)));
    kernel::sumNND<<<numBlocks, blockSize>>>(devPtrIn, devPtrOut, N, D);
    HANDLE_ERROR(cudaDeviceSynchronize());
}

where sumNND is being called with 在其中sumNND地方

loopSize = N * N * D , blockSize = 768 and numBlocks = (loopSize + blockSize - 1) / blockSize . loopSize = N * N * D ， blockSize = 768和numBlocks = (loopSize + blockSize - 1) / blockSize 。

This is (not surprisingly) a bottleneck in my timeline, but I can't figure out how to effectively parallelize the dimension reduction. 这是我的时间轴上的瓶颈（不足为奇），但是我不知道如何有效地并行化降维。 Any pointers? 有指针吗？

Answer 1

The first two optimization priorities for any CUDA programmer are: 任何CUDA程序员的前两个优化优先级是：

Use lots of threads 使用很多线程
Use memory efficiently 有效地使用内存

For your problem, you'll have no trouble with the first one - it readily decomposes into a set of problems that are independent and can be assigned to a lot of parallel threads. 对于您的问题，您不会遇到第一个问题-它很容易分解为一系列独立的问题，可以分配给许多并行线程。 The second priority is where you want to focus, then. 然后，第二优先级是您要关注的地方。 With respect to global memory, this means we should strive for coalesced access, whenever possible. 关于全局内存，这意味着我们应该尽可能地争取合并访问。 We should pay particular attention to the reads. 我们应该特别注意阅读内容。

I'll need to make some assumptions. 我需要做一些假设。 I'll assume that your organization of dimensions is ROW, COLUMN, DEPTH and that your data is stored in the usual C-style, ie row-major storage. 我假设您的维度组织为ROW，COLUMN，DEPTH，并且您的数据存储在通常的C样式（即行为主的存储）中。 With those assumptions, then, the request ( summing over the entire first (or second) axis ) is effectively summing over the entire row or summing over the entire column. 然后，利用这些假设，请求（ 在整个第一（或第二）轴上求和）实际上是在整个行上求和或在整个列上求和。 If you do a bit of searching here on the cuda tag, you'll find worked examples for both ( here is one such example). 如果您在此处在cuda标记上进行了一些搜索，则会找到两者的有效示例（这里是一个这样的示例）。 Although they don't necessarily all cover the 3D case, they should provide a pretty good roadmap. 尽管它们不一定全部涵盖3D情况，但它们应该提供一个很好的路线图。 What you'll discover is that these two cases should be handled differently, with an eye towards coalesced global memory access , ie the optimization priority already mentioned. 您会发现这两种情况应该以不同的方式处理，着眼于合并的全局内存访问 ，即已经提到的优化优先级。 The row-direction is also the coalescing direction, so if we need to sum rows, then we'll need to use a classical parallel reduction technique, so that we can read rows, and sum the elements together. 行方向也是合并方向，因此，如果需要对行求和，则需要使用经典的并行约简技术，以便可以读取行并将元素求和在一起。 If we need to sum columns, the efficient kernel is much easier to write; 如果我们需要对列求和，那么高效的内核更容易编写； each thread can be responsible for a column, and can just keep a running sum in a for-loop. 每个线程可以负责一列，并且可以只将一个运行中的总和保持在for循环中。

In your case, you appear to be summing columns (but see note below). 就您而言，您似乎正在对列求和（但请参见下面的注释）。 What follows is a worked example, comparing your approach to the faster running-column-sum method, with coalesced access (adjacent threads reading adjacent elements in memory): 下面是一个有效的示例，将您的方法与运行速度更快的column-sum方法进行了比较，并结合了访问（相邻线程读取内存中的相邻元素）：

$ cat t1263.cu
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

const int my_N = 10000;
const int my_D = 3;
const int my_blockSize = 768;
const int my_loopSize = my_N*my_N*my_D;
const int my_numBlocks = (my_loopSize + my_blockSize -1)/my_blockSize;
const int bsize = 512;
const float TOL = 0.1f;

#define HANDLE_ERROR(x) x

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

namespace kernel {
    __global__ void sumNND(float* devPtrIn, float* devPtrOut, const int N, const int D) {
        int index = blockIdx.x * blockDim.x + threadIdx.x;
        int stride = blockDim.x * gridDim.x;

        for (int id = index; id < N * N * D; id += stride) {
            const unsigned int d = id % D;
            const unsigned int i = (id - d) / D;
            const unsigned int n = i / N;
            const unsigned int m = i % N;

            atomicAdd(&devPtrOut[d + D * n], devPtrIn[d + D * n + N * m]);
        }
    }
}

void sumNND(const int numBlocks, const int blockSize, float* devPtrIn, float* devPtrOut, const int N, const int D) {
    HANDLE_ERROR(cudaMemset(devPtrOut, 0, N * D * sizeof(float)));
    kernel::sumNND<<<numBlocks, blockSize>>>(devPtrIn, devPtrOut, N, D);
    HANDLE_ERROR(cudaDeviceSynchronize());
}

// kernel assumes 1 block assigned per row, use block-striding methodology
// assumes block size is a power of 2
__global__ void sum_rows_NND(const float * __restrict__  devPtrIn, float * __restrict__  devPtrOut, const int N, const int D) {
  __shared__ float sdata[bsize];
  sdata[threadIdx.x] = 0;
  for (int i = threadIdx.x; i < N; i += blockDim.x) // block-stride
    sdata[threadIdx.x] += devPtrIn[(blockIdx.x * N) + i];
  __syncthreads();
  for (int i = blockDim.x>>1; i > 0; i>>=1){
    if (threadIdx.x < i) sdata[threadIdx.x] += sdata[threadIdx.x+i];
    __syncthreads();}
  if (!threadIdx.x) devPtrOut[blockIdx.x] = sdata[0];
}



// kernel assumes one thread assigned per column sum
// launch N threads
 __global__ void sum_cols_NND(const float * __restrict__  devPtrIn, float * __restrict__  devPtrOut, const int N, const int D) {
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int ido = idx;
  if (idx < N){
    for (int j = 0; j < D; j++){
      float temp = 0;
      for (int i = 0; i < N; i++) temp += devPtrIn[idx + (i*N)];
      devPtrOut[ido] = temp;
      ido += N;
      idx += N*N;}}
}

int main(){

  float *h_data, *d_data, *h_res1, *h_res2, *d_res;

  h_data = new float[my_loopSize];
  cudaMalloc(&d_data, my_loopSize*sizeof(d_data[0]));
  h_res1 = new float[my_N*my_D];
  h_res2 = new float[my_N*my_D];
  cudaMalloc(&d_res, my_N*my_D*sizeof(d_res[0]));
  for (int i = 0; i < my_loopSize; i++) h_data[i] = rand()/(float)RAND_MAX;
  cudaCheckErrors("CUDA failure");
  cudaMemcpy(d_data, h_data, my_loopSize*sizeof(d_data[0]), cudaMemcpyHostToDevice);
  // test original approach
  cudaMemset(d_res, 0, my_N*my_D*sizeof(d_res[0]));
  unsigned long long dt1 = dtime_usec(0);
  kernel::sumNND<<<my_numBlocks, my_blockSize>>>(d_data, d_res, my_N, my_D);
  cudaDeviceSynchronize();
  dt1 = dtime_usec(dt1);
  cudaMemcpy(h_res1, d_res, my_N*my_D*sizeof(d_res[0]), cudaMemcpyDeviceToHost);

  //test columnwise reduction
  unsigned long long dt2 = dtime_usec(0);
  //sum_rows_NND<<<my_N*my_D, bsize>>>(d_data, d_res, my_N, my_D);
  sum_cols_NND<<<(my_N + bsize -1)/bsize, bsize>>>(d_data, d_res, my_N, my_D);
  cudaDeviceSynchronize();
  dt2 = dtime_usec(dt2);
  cudaMemcpy(h_res2, d_res, my_N*my_D*sizeof(d_res[0]), cudaMemcpyDeviceToHost);

  // validate results
  for (int i = 0; i < my_N; i++)
    if (fabsf(h_res1[i] - h_res2[i]) > TOL) {printf("mismatch at %d, was %f, should be %f\n", i, h_res2[i], h_res1[i]); return -1;}
  cudaCheckErrors("program error");

  printf("results match,  kernel 1 time: %fs, kernel 2 time: %fs\n", dt1/(float)USECPSEC, dt2/(float)USECPSEC);
  // time row reduction kernel
  unsigned long long dt3 = dtime_usec(0);
  sum_rows_NND<<<my_N*my_D, bsize>>>(d_data, d_res, my_N, my_D);
  cudaDeviceSynchronize();
  dt3 = dtime_usec(dt3);
  printf("row reduction kernel time: %fs\n", dt3/(float)USECPSEC);
  cudaCheckErrors("program error");
}
$ nvcc -arch=sm_52 -o t1263 t1263.cu
$ ./t1263
results match,  kernel 1 time: 0.459971s, kernel 2 time: 0.013678s
row reduction kernel time: 0.013724s
$

Notes: 笔记：

The optimized kernel is around 30x faster than your naive atomics kernel. 经过优化的内核比您的朴素原子内核快30倍左右。 I suspect that a big chunk of this is not actually the use of atomics, but the uncoalesced access. 我怀疑其中很大一部分实际上不是原子的使用，而是未分批访问。 global atomics on newer GPUs can be pretty fast. 新型GPU上的全局原子可能很快。
The first "page" (NxN) of elements column sum match between my kernel and your kernel (ie the first N results match). 我的内核和您的内核之间元素列总和的第一个“页面”（NxN）匹配（即，前N个结果匹配）。 After the first page (first N results), our results differ. 第一页之后（前N个结果），我们的结果有所不同。 I'm pretty sure my indexing is correct, but after spending a while trying to unravel your indexing, I gave up. 我很确定我的索引编制是正确的，但是花了一段时间尝试弄清您的索引编制之后，我放弃了。 I suspect you have a bug in your kernel indexing, if you are trying to sum columns, and all the aforementioned assumptions are true. 如果您尝试对列求和，我怀疑您的内核索引中有一个错误，并且所有上述假设都是正确的。
I also included a timing measurement of the row-summing kernel, which looks quite different, but produces almost the same timing. 我还包括了行求和内核的时序测量，它看起来有很大不同，但是产生的时序几乎相同。 This is to be expected, since optimal kernels for these types of problems will be limited by memory bandwidth, which is the same in both cases. 这是可以预料的，因为针对这些类型问题的最佳内核将受到内存带宽的限制，这在两种情况下都是相同的。 Optimal kernels will load all the data, once, in a coalesced fashion. 最佳内核将以合并的方式一次加载所有数据。 After that, the row-sum vs. column-sum mechanics have relatively little effect on the kernel time. 之后，行和与列和机制对内核时间的影响相对较小。
With a small modification to the initialization of the data, I think it's fairly easy to prove that your kernel is not creating the correct indexing and therefore not producing the correct row-sums after the first "page" (ie after the first N results). 通过对数据的初始化进行少量修改，我认为很容易证明您的内核没有创建正确的索引，因此没有在第一个“页面”之后（即在前N结果之后）产生正确的行总和。。 After a little more study of your indexing, I have some idea of what is going wrong. 在对索引进行了更多研究之后，我对出了什么问题有了一些了解。 One example problem is that for N not divisible by D , your kernel d variable will not reset to zero after the first "page", but this is not the only issue. 一个示例问题是，对于不能被D整除的N ，您的内核d变量在第一个“页面”之后将不会重置为零，但这不是唯一的问题。

Pursuant to item 4, here's a version of the code that has modified data initialization, and a full test of all N * D results. 根据第4项，这是修改了数据初始化的代码版本，并对所有N * D结果进行了全面测试。 The data initialization is such that the first column of the first page will be all zero, the next column all 1, the next column all 2, etc. On the second page, we increment everything by 1, so the first column will be all 1, the second column will be all 2, etc. Therefore it should be easy to agree on what the column sums should be. 数据初始化为，第一页的第一列将全部为零，下一列的全部为1，下一列的全部为2，依此类推。在第二页上，我们将所有内容加1，因此第一列将全部为1，第二列全为2，依此类推。因此，应该很容易就列的总和达成一致。 For the first page, the column sums should be 0, 10000, 20000, etc. For the second page they should be 10000, 20000, 30000, etc. On the first column of the second page, my code produces 10000, your code produces 1. With your changed indexing in the comments, I produce 0 for the first column of the first page, and your code produces 9999. 1 and 9999 cannot possibly be valid column sums according to the data initialization I described: 对于第一页，列的总和应为0、10000、20000等。对于第二页，它们的应为10000、20000、30000等。在第二页的第一列上，我的代码生成10000，您的代码生成1.在注释中更改索引后，第一页的第一列将产生0，而您的代码将产生9999。根据我描述的数据初始化，1和9999可能不是有效的列总和：

$ cat t1263.cu
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

const int my_N = 10000;
const int my_D = 3;
const int my_blockSize = 768;
const int my_loopSize = my_N*my_N*my_D;
const int my_numBlocks = (my_loopSize + my_blockSize -1)/my_blockSize;
const int bsize = 512;
const float TOL = 0.1f;

#define HANDLE_ERROR(x) x

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

namespace kernel {
    __global__ void sumNND(float* devPtrIn, float* devPtrOut, const int N, const int D) {
        int index = blockIdx.x * blockDim.x + threadIdx.x;
        int stride = blockDim.x * gridDim.x;

        for (int id = index; id < N * N * D; id += stride) {
            const unsigned int d = id % D;       // 0 1 2 0 1 2 0 1 2
            const unsigned int i = (id - d) / D; // 0 0 0 1 1 1 2 2 2
            const unsigned int n = i / N;        // 0 0 0 0 0 0 0 0 0
            const unsigned int m = i % N;        // 0 0 0 1 1 1 2 2 2

            atomicAdd(&devPtrOut[d + D * n],    //  0 1 2 0 1 2 0 1 2
              devPtrIn[d + D * n + N * m]);     //  0 1 2 0+N 1+N 2+N 0+2N 1+2N 2+2N
        }
    }
}

void sumNND(const int numBlocks, const int blockSize, float* devPtrIn, float* devPtrOut, const int N, const int D) {
    HANDLE_ERROR(cudaMemset(devPtrOut, 0, N * D * sizeof(float)));
    kernel::sumNND<<<numBlocks, blockSize>>>(devPtrIn, devPtrOut, N, D);
    HANDLE_ERROR(cudaDeviceSynchronize());
}

// kernel assumes 1 block assigned per row, use block-striding methodology
// assumes block size is a power of 2
__global__ void sum_rows_NND(const float * __restrict__  devPtrIn, float * __restrict__  devPtrOut, const int N, const int D) {
  __shared__ float sdata[bsize];
  sdata[threadIdx.x] = 0;
  for (int i = threadIdx.x; i < N; i += blockDim.x) // block-stride
    sdata[threadIdx.x] += devPtrIn[(blockIdx.x * N) + i];
  __syncthreads();
  for (int i = blockDim.x>>1; i > 0; i>>=1){
    if (threadIdx.x < i) sdata[threadIdx.x] += sdata[threadIdx.x+i];
    __syncthreads();}
  if (!threadIdx.x) devPtrOut[blockIdx.x] = sdata[0];
}



// kernel assumes one thread assigned per column sum
// launch N threads
 __global__ void sum_cols_NND(const float * __restrict__  devPtrIn, float * __restrict__  devPtrOut, const int N, const int D) {
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int ido = idx;
  if (idx < N){
    for (int j = 0; j < D; j++){
      float temp = 0;
      for (int i = 0; i < N; i++) temp += devPtrIn[idx + (i*N)];
      devPtrOut[ido] = temp;
      ido += N;
      idx += N*N;}}
}

int main(){

  float *h_data, *d_data, *h_res1, *h_res2, *d_res;

  h_data = new float[my_loopSize];
  cudaMalloc(&d_data, my_loopSize*sizeof(d_data[0]));
  h_res1 = new float[my_N*my_D];
  h_res2 = new float[my_N*my_D];
  cudaMalloc(&d_res, my_N*my_D*sizeof(d_res[0]));
  for (int i = 0; i < my_loopSize; i++) h_data[i] = i%my_N + i/(my_N*my_N); //rand()/(float)RAND_MAX;
  cudaCheckErrors("CUDA failure");
  cudaMemcpy(d_data, h_data, my_loopSize*sizeof(d_data[0]), cudaMemcpyHostToDevice);
  // test original approach
  cudaMemset(d_res, 0, my_N*my_D*sizeof(d_res[0]));
  unsigned long long dt1 = dtime_usec(0);
  kernel::sumNND<<<my_numBlocks, my_blockSize>>>(d_data, d_res, my_N, my_D);
  cudaDeviceSynchronize();
  dt1 = dtime_usec(dt1);
  cudaMemcpy(h_res1, d_res, my_N*my_D*sizeof(d_res[0]), cudaMemcpyDeviceToHost);

  //test columnwise reduction
  unsigned long long dt2 = dtime_usec(0);
  //sum_rows_NND<<<my_N*my_D, bsize>>>(d_data, d_res, my_N, my_D);
  sum_cols_NND<<<(my_N + bsize -1)/bsize, bsize>>>(d_data, d_res, my_N, my_D);
  cudaDeviceSynchronize();
  dt2 = dtime_usec(dt2);
  cudaMemcpy(h_res2, d_res, my_N*my_D*sizeof(d_res[0]), cudaMemcpyDeviceToHost);

  // validate results
  for (int i = 0; i < my_N*my_D; i++)
    if (fabsf(h_res1[i] - h_res2[i]) > TOL) {printf("mismatch at %d, was %f, should be %f\n", i, h_res2[i], h_res1[i]); return -1;}
  cudaCheckErrors("program error");

  printf("results match,  kernel 1 time: %fs, kernel 2 time: %fs\n", dt1/(float)USECPSEC, dt2/(float)USECPSEC);
  // time row reduction kernel
  unsigned long long dt3 = dtime_usec(0);
  sum_rows_NND<<<my_N*my_D, bsize>>>(d_data, d_res, my_N, my_D);
  cudaDeviceSynchronize();
  dt3 = dtime_usec(dt3);
  printf("row reduction kernel time: %fs\n", dt3/(float)USECPSEC);
  cudaCheckErrors("program error");
}
$ nvcc -arch=sm_52 -o t1263 t1263.cu
$ ./t1263
mismatch at 10000, was 10000.000000, should be 1.000000
$

Answer 2

This depends on which order your matrix is stored in and which dimension you want to reduce along. 这取决于矩阵的存储顺序以及要减小的维数。

For the moment, I'll ignore the D dimension as the operation can be thought of as reducing a matrix containing NxN entries where every entry contains multiple floats. 目前，我将忽略D维，因为可以将操作视为减少包含NxN个条目的矩阵，其中每个条目都包含多个浮点数。

If your matrix is stored in row-major order and you want to reduce each row to its sum (or column-major and column reduction), the answer is simple: 如果矩阵以行优先的顺序存储，并且您希望将每一行减少到其总和（或列主行和列的约简），那么答案很简单：

const int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N) { // necessary if N is not divisible by the thread block size
    float sum = 0; // stores the partial sum in a register
    for (int col = 0; col < N; ++col) {
        sum += devPtrIn[col + N * row];
    }
    devPtrOut[row] = sum; // no atomic operation necessary
}

This way, every thread reads memory in a coalescent fashion (see NVIDIA's parallel forall blog for a discussion of global memory access patterns) and needs no shared or global memory writes except for the final result. 这样，每个线程都以合并方式读取内存（有关全局内存访问模式的讨论，请参见NVIDIA的并行forall博客），除了最终结果外，不需要共享或全局内存写入。

If you want to reduce along a minor dimension - let's say column reduction on a row-major matrix - the answer becomes a bit more difficult: Because of the large stride, memory access would more or less behaves like random access if we used only one entry of the column at a time. 如果要沿着较小的维数进行缩减-假设在行较大的矩阵上进行列缩减-答案将变得更加困难：由于步幅较大，如果仅使用一个，则内存访问将或多或少地表现为随机访问一次输入该列。

Thus it makes sense for every thread to reduce a small number of columns along the matrix in parallel and store the partial results in shared memory: 因此，对于每个线程来说，沿着矩阵并行减少少量列并将部分结果存储在共享内存中是有意义的：

constexpr int numCols = ...;
__shared__ float partial[numCols * blockDim.x];
const int threadId = blockIdx.x * blockDim.x + threadIdx.x;
const int begin_col = threadId * numCols;
const int end_col = min(N, (threadId + 1) * numCols);
// initialize partial to 0
...
for (int row = 0; row < N; ++row) {
    for (int col = begin_col; col < end_col; ++col) {
        partial[threadIdx.x * numCols + col] += devPtrIn[col + N * row];
    }
}
// store partial to global memory
...

Depending on the number of registers per thread your GPU has, it might also be possible to store the partial sums in registers by unrolling the inner loop and using local variables instead of an array, since arrays are usually not stored in registers 根据GPU每个线程拥有的寄存器数量，还可能通过展开内部循环并使用局部变量而不是数组来将部分和存储在寄存器中，因为数组通常不存储在寄存器中

This way, we always read contiguous blocks of numCols floats from memory, which gives a much larger bandwidth than access with large strides. 这样，我们总是从内存中读取numCols浮点数的连续块，这提供了比大步幅访问更大的带宽。

You'll probably have to experiment with an optimal value for numCols , but it should be large enough that at least the memory width of the GPU memory is used to load such a block, and at the same time small enough that all shared memory for a single thread block fits onto the GPU (again see parallel forall for details) 您可能必须尝试使用numCols的最佳值，但是它应该足够大，至少要使用GPU内存的内存宽度来加载这样的块，同时又要足够小，以便所有共享内存用于一个线程块适合GPU（有关详细信息，请再次参见parallel forall ）

使用Cuda进行并行尺寸缩减（3D到2D求和）

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-12-27 16:58:01

解决方案2
1 2017-12-27 15:45:07

使用Cuda进行并行尺寸缩减（3D到2D求和）

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-12-27 16:58:01

解决方案2 1 2017-12-27 15:45:07

解决方案1
3 已采纳 2017-12-27 16:58:01

解决方案2
1 2017-12-27 15:45:07