如何使用 CUDA Thrust 執行策略覆蓋 Thrust 的低級設備內存分配器

Question

我想覆蓋低級 CUDA 設備內存分配器（實現為推力::系統::cuda::detail::malloc()），以便在調用時使用自定義分配器而不是直接調用 cudaMalloc()主機（CPU）線程。

這可能嗎？ 如果是這樣，是否可以使用Thrust“執行策略”機制來做到這一點？ 我試過這樣的模型：

struct eptCGA : thrust::system::cuda::detail::execution_policy<eptCGA>
{
};

/// overload the Thrust malloc() template function implementation
template<typename eptCGA> __host__ __device__ void* malloc( eptCGA, size_t n )
{
#ifndef __CUDA_ARCH__
    return MyMalloc( n );   /* (called from a host thread) */
#else
    return NULL;            /* (called from a device GPU thread) */
#endif
}


/* called as follows, for example */
eptCGA epCGA;
thrust::remove_if( epCGA, ... );

這有效。 但是還有其他 Thrust 組件調用低級 malloc 實現，似乎沒有使用“執行策略”機制。 例如，

    thrust::device_vector<UINT64> MyDeviceVector( ... );

不公開帶有“執行策略”參數的重載。 相反，malloc() 在 15 個嵌套函數調用的底部被調用，使用的執行策略似乎硬連接到調用堆棧中間某處的 Thrust 函數之一。

有人可以澄清我采取的方法是如何不正確的，並解釋可行的實現應該做什么嗎？

Answer 1

這是對我有用的東西。 您可以創建一個自定義執行策略和分配器，它們一次性使用您的自定義 malloc：

#include <thrust/system/cuda/execution_policy.h>
#include <thrust/system/cuda/memory.h>
#include <thrust/system/cuda/vector.h>
#include <thrust/remove.h>

// create a custom execution policy by deriving from the existing cuda::execution_policy
struct my_policy : thrust::cuda::execution_policy<my_policy> {};

// provide an overload of malloc() for my_policy
__host__ __device__ void* malloc(my_policy, size_t n )
{
  printf("hello, world from my special malloc!\n");

  return thrust::raw_pointer_cast(thrust::cuda::malloc(n));
}

// create a custom allocator which will use our malloc
// we can inherit from cuda::allocator to reuse its existing functionality
template<class T>
struct my_allocator : thrust::cuda::allocator<T>
{
  using super_t = thrust::cuda::allocator<T>;
  using pointer = typename super_t::pointer;

  pointer allocate(size_t n)
  {
    T* raw_ptr = reinterpret_cast<T*>(malloc(my_policy{}, sizeof(T) * n));

    // wrap the raw pointer in the special pointer wrapper for cuda pointers
    return pointer(raw_ptr);
  }
};

template<class T>
using my_vector = thrust::cuda::vector<T, my_allocator<T>>;

int main()
{
  my_vector<int> vec(10, 13);
  vec.push_back(7);

  assert(thrust::count(vec.begin(), vec.end(), 13) == 10);

  // because we're superstitious
  my_policy policy;
  auto new_end = thrust::remove(policy, vec.begin(), vec.end(), 13);
  vec.erase(new_end, vec.end());
  assert(vec.size() == 1);

  return 0;
}

這是我系統上的輸出：

$ nvcc -std=c++11 -I. test.cu -run
hello, world from my special malloc!
hello, world from my special malloc!
hello, world from my special malloc!
hello, world from my special malloc!

您可以更高級並使用thrust::pointer<T,Tag>包裝器將my_policy合並到自定義pointer類型中。 這將具有使用my_policy而不是 CUDA 執行策略標記my_vector的迭代器的效果。 這樣，您就不必為每個算法調用提供明確的執行策略（如示例對thrust::remove的調用所做的那樣）。 相反，Thrust 只需查看my_vector迭代器的類型就知道使用您的自定義執行策略。

如何使用 CUDA Thrust 執行策略覆蓋 Thrust 的低級設備內存分配器

問題描述

1 個解決方案

解決方案1
2 已采納 2016-04-29 22:36:31

如何使用 CUDA Thrust 執行策略覆蓋 Thrust 的低級設備內存分配器

問題描述

1 個解決方案

解決方案1 2 已采納 2016-04-29 22:36:31

解決方案1
2 已采納 2016-04-29 22:36:31