How to use CUDA Thrust execution policy to override Thrust's low-level device memory allocator

Question

I want to override the low-level CUDA device memory allocator (implemented as thrust::system::cuda::detail::malloc()) so that it uses a custom allocator instead of call directly to cudaMalloc() when invoked on a host (CPU) thread.

Is this possible? If so, is it possible to use the Thrust "execution policy" mechanism to do it? I have tried a model like this:

struct eptCGA : thrust::system::cuda::detail::execution_policy<eptCGA>
{
};

/// overload the Thrust malloc() template function implementation
template<typename eptCGA> __host__ __device__ void* malloc( eptCGA, size_t n )
{
#ifndef __CUDA_ARCH__
    return MyMalloc( n );   /* (called from a host thread) */
#else
    return NULL;            /* (called from a device GPU thread) */
#endif
}


/* called as follows, for example */
eptCGA epCGA;
thrust::remove_if( epCGA, ... );

This works. But there are other components of Thrust that call down to the low-level malloc implementation, seemingly without using the "execution policy" mechanism. For example,

    thrust::device_vector<UINT64> MyDeviceVector( ... );

does not expose an overload with an "execution policy" parameter. Instead, malloc() gets invoked at the bottom of 15 nested function calls, using an execution policy that is seemingly hardwired into one of the Thrust functions somewhere in the middle of that call stack.

Can someone please clarify how the approach I am taking is incorrect, and explain what a workable implementation should be doing?

Answer 1

Here's something that worked for me. You can create both a custom execution policy and allocator which use your custom malloc all in one go:

#include <thrust/system/cuda/execution_policy.h>
#include <thrust/system/cuda/memory.h>
#include <thrust/system/cuda/vector.h>
#include <thrust/remove.h>

// create a custom execution policy by deriving from the existing cuda::execution_policy
struct my_policy : thrust::cuda::execution_policy<my_policy> {};

// provide an overload of malloc() for my_policy
__host__ __device__ void* malloc(my_policy, size_t n )
{
  printf("hello, world from my special malloc!\n");

  return thrust::raw_pointer_cast(thrust::cuda::malloc(n));
}

// create a custom allocator which will use our malloc
// we can inherit from cuda::allocator to reuse its existing functionality
template<class T>
struct my_allocator : thrust::cuda::allocator<T>
{
  using super_t = thrust::cuda::allocator<T>;
  using pointer = typename super_t::pointer;

  pointer allocate(size_t n)
  {
    T* raw_ptr = reinterpret_cast<T*>(malloc(my_policy{}, sizeof(T) * n));

    // wrap the raw pointer in the special pointer wrapper for cuda pointers
    return pointer(raw_ptr);
  }
};

template<class T>
using my_vector = thrust::cuda::vector<T, my_allocator<T>>;

int main()
{
  my_vector<int> vec(10, 13);
  vec.push_back(7);

  assert(thrust::count(vec.begin(), vec.end(), 13) == 10);

  // because we're superstitious
  my_policy policy;
  auto new_end = thrust::remove(policy, vec.begin(), vec.end(), 13);
  vec.erase(new_end, vec.end());
  assert(vec.size() == 1);

  return 0;
}

Here's the output on my system:

$ nvcc -std=c++11 -I. test.cu -run
hello, world from my special malloc!
hello, world from my special malloc!
hello, world from my special malloc!
hello, world from my special malloc!

You could get even fancier and use the thrust::pointer<T,Tag> wrapper to incorporate my_policy into a custom pointer type. This would have the effect of tagging my_vector 's iterators with my_policy instead of the CUDA execution policy. That way, you wouldn't have to provide an explicit execution policy with each algorithm invocation (as the example does with the invocation of thrust::remove ). Instead, Thrust would know to use your custom execution policy just by looking at the types of my_vector 's iterator.

How to use CUDA Thrust execution policy to override Thrust's low-level device memory allocator

Question

1 answers

solution1
2 ACCPTED 2016-04-29 22:36:31

How to use CUDA Thrust execution policy to override Thrust's low-level device memory allocator

Question

1 answers

solution1 2 ACCPTED 2016-04-29 22:36:31

solution1
2 ACCPTED 2016-04-29 22:36:31