简体   繁体   中英

CUDA Thrust Functor with Flexibility to Run in CPU or GPU

This might be a stupid question, but I cannot seem to be able to find any resources specifically related to it, so your opinion is appreciated.

Let's say I have some functor

struct AddOne {
    thrust::device_ptr<int> numbers;

    __device__
    void operator()(int i) {
        numbers[i] = numbers[i] + 1;
    }
}

that i can call from

AddOne addOneFunctor;
thrust::device_vector<int> idx(100), numbers(100);
addOneFunctor.numbers = numbers.data();
thrust::sequence(idx.begin(), idx.end(), 0);
thrust::for_each(thrust::device, idx.begin(), idx.end(), addOneFunctor);

Is it is possible to write the above so that the execution policy can be changed at either compile-time or ideally run-time?

Eg change the struct to

struct AddOne {
    thrust::pointer<int> numbers;

     __host__ __device__
     void operator()(int i) {
         numbers[i] = numbers[i] + 1;
     }
}

so it can be run something like

AddOne addOneFunctor;
std::vector<int> idx(100), numbers(100);
addOneFunctor.numbers = numbers.data();
thrust::sequence(idx.begin(), idx.end(), 0);
thrust::for_each(thrust::cpp::par, idx.begin(), idx.end(), addOneFunctor);

The bottom line is: I would like to have a single code-base where I can decide to either use thrust::device_vector s or some sort of host vector (such as std::vector s) and run these in the GPU (using thrust::device exec policy) or CPU (using thrust::cpp::par or similar policy) respectively.

PS: I would like to avoid PGI for now.

Yes, it's possible, pretty much exactly as you describe.

Here's a fully-worked example:

$ cat t1205.cu
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <iostream>
#include <vector>

struct AddOne {
    int *numbers;
     template <typename T>
     __host__ __device__
     void operator()(T &i) {
         numbers[i] = numbers[i] + 1;
     }
};

int main(){

  AddOne addOneFunctor;
  std::vector<int> idx(100), numbers(100);
  addOneFunctor.numbers = thrust::raw_pointer_cast(numbers.data());
  thrust::sequence(idx.begin(), idx.end(), 0);
  thrust::for_each(thrust::cpp::par, idx.begin(), idx.end(), addOneFunctor);
  for (int i = 0; i < 5; i++)
    std::cout << numbers[i] << ",";
  std::cout << std::endl;

  thrust::device_vector<int> didx(100), dnumbers(100);
  addOneFunctor.numbers = thrust::raw_pointer_cast(dnumbers.data());
  thrust::sequence(didx.begin(), didx.end(), 0);
  thrust::for_each(thrust::device, didx.begin(), didx.end(), addOneFunctor);
  for (int i = 0; i < 5; i++)
    std::cout << dnumbers[i] << ",";
  std::cout << std::endl;
}
$ nvcc -o t1205 t1205.cu
$ ./t1205
1,1,1,1,1,
1,1,1,1,1,
$

Note that the algorithim is thrust::sequence not thrust::seq .

Using CUDA 8RC

As @ms points out, the explict use of the execution policies on the algorithms for the codes above are not necessary - you can remove those and it will work the same way. However the formal usage of execution policy allows the above example to be extended to the case where you are not using containers, but ordinary host and device data, so it may still have some value, depending on your overall goals.

Would this fit your requirement?

  1. Always use thrust::device_vector to run on device;

  2. Define different macros on compile time to select the device to be GPU or CPU (OpenMP/TBB/CPP).

More info here:

https://github.com/thrust/thrust/wiki/Device-Backends

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM