计算具有推力的设备阵列的归约和

Question

我知道我们可以用这样的推力计算 CPU（主机）数组的总和。

int data[6] = {1, 0, 2, 2, 1, 3};
int result = thrust::reduce(data, data + 6, 0);

我们可以在没有cudaMemcpy的情况下找到 GPU 阵列的总和到 CPU 阵列吗？
假设我有一个像这样使用cudaMalloc创建的设备数组，

cudaMalloc(&gpuspeed, n* sizeof(int));

并使用一些内核对gpuspeed进行了修改。 现在我可以找到推力的总和吗？ 如果可以，我必须做出哪些改变？

Answer 1

是的，你可以用推力做到这一点。

您可以将设备指针传递给推力，如果您使用推力执行策略明确指定设备执行路径，推力将做正确的事情。

或者，您可以使用thrust::device_ptr来引用您的数据，推力也会做正确的事情，即使没有明确指定设备执行路径。

这个答案涵盖了这两种方法，尽管使用了inclusive_scan 。

这是一个例子：

$ cat t137.cu
#include <thrust/reduce.h>
#include <thrust/device_ptr.h>
#include <thrust/execution_policy.h>
#include <iostream>

__global__ void k(int *d, int n){
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n)
    d[idx] = idx;
}
const int ds = 10;
const int nTPB = 256;
int main(){

  int *d, r1, r2;
  cudaMalloc(&d, ds*sizeof(d[0]));
  k<<<(ds+nTPB-1)/nTPB,nTPB>>>(d, ds);
  thrust::device_ptr<int> tdp = thrust::device_pointer_cast(d);
  r1 = thrust::reduce(tdp, tdp+ds);
  r2 = thrust::reduce(thrust::device, d, d+ds);
  std::cout << "r1: "  << r1 << " r2: " << r2 << std::endl;
}
$ nvcc -std=c++14 -o t137 t137.cu
$ ./t137
r1: 45 r2: 45
$

计算具有推力的设备阵列的归约和

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-07 16:50:00

计算具有推力的设备阵列的归约和

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-07 16:50:00

解决方案1
1 已采纳 2021-05-07 16:50:00