简体   繁体   English

如何防止将推力的 device_vector 复制到设备

[英]How to prevent the copy of thrust's device_vector to device

So I have a helper class (creatively named “BetterVector”) that is designed to be passed back and forth from host and device, with most of its functionality accessible from either side (a significant flaw of device_vector).所以我有一个帮助器 class(创造性地命名为“BetterVector”),它被设计为从主机和设备来回传递,它的大部分功能都可以从任一侧访问(device_vector 的一个重大缺陷)。 However, kernels fail with a non-descriptive allocation error.但是,内核会因非描述性分配错误而失败。

From the stack trace, it appears to trigger sometimes on the copy constructor, and sometimes on the deconstructor, and I'm not entirely sure why it changes.从堆栈跟踪来看,它似乎有时在复制构造函数上触发,有时在解构函数上触发,我不完全确定它为什么会改变。 I figured it was the device_vector data member having a host-only constructor and deconstructor, which I used the following post to utilize a union to prevent the calling of these functions, but the issue still persists.我认为它是具有仅主机构造函数和解构函数的 device_vector 数据成员,我使用以下帖子利用联合来防止调用这些函数,但问题仍然存在。 If any of you have any suggestions, it would be greatly appreciated.如果你们中的任何人有任何建议,将不胜感激。

main.cu testing file: main.cu 测试文件:

#include <abstract/BetterVector.cuh>

struct thrust_functor {
    abstract::BetterVector<int> vector;

    explicit thrust_functor(const abstract::BetterVector<int> &vector) : vector(vector) {}

    __host__ void operator()(int i) {
        printf("Thrust functor index %d: %d\n", i, (int) vector[i]);
    }
};

__global__ void baseCudaPrint(abstract::BetterVector<int>* ptr) {
    const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
    abstract::BetterVector<int> vector = *ptr;
    printf("Cuda kernel index %zu: %d\n", i, (int) vector[i]);
}


int main() {
    abstract::BetterVector<int> vector({1, 2, 3, 4});
    for (int i = 0; i < 4; i++) {
        printf("Host index %d: %d\n", i, (int) vector[i]);
    }
    printf("\n");

    abstract::BetterVector<int>* devVectorPtr;
    cudaMalloc(&devVectorPtr, sizeof(abstract::BetterVector<int>));
    cudaMemcpy(devVectorPtr, &vector, 1, cudaMemcpyHostToDevice);
    baseCudaPrint<<<1, vector.size()>>>(devVectorPtr);
    cudaDeviceSynchronize();
    cudaFree(devVectorPtr);
    printf("\n");

    thrust::counting_iterator<int> first(0);
    thrust::counting_iterator<int> last = first + vector.size();
    thrust::for_each(thrust::host, first, last, thrust_functor(vector));
    cudaDeviceSynchronize();
    printf("\n");
}

abstract/BetterVector.cuh:抽象/BetterVector.cuh:

#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/functional.h>

namespace abstract {
template<typename T>
    struct equal_to : public thrust::unary_function<T, bool> {
        T lhs;

        __device__ __host__ explicit equal_to(T lhs) : lhs(lhs) {}

        __device__ __host__ bool operator()(T rhs) {
            return lhs == rhs;
        }
    };
template<typename T, typename VecType = thrust::device_vector<T>>
class BetterVector {
protected:
    typename VecType::pointer raw;
    size_t cachedSize;
    union {
        VecType vector;
    };

public:

    __host__ BetterVector() : vector(), raw(vector.data()), cachedSize(0) {}

    __host__ explicit BetterVector(size_t size) : vector(size), raw(vector.data()), cachedSize(size) {}

    __host__ explicit BetterVector(VecType vec) : vector(vec), raw(vector.data()), cachedSize(vec.size()) {}

    __host__ explicit BetterVector(std::vector<T> vec) : vector(vec), raw(vector.data()), cachedSize(vec.size()) {}

    __host__ __device__ BetterVector(const BetterVector &otherVec) :
#ifndef __CUDA_ARCH__
            vector(otherVec.vector),
#endif
            cachedSize(otherVec.cachedSize), raw(otherVec.raw) {}


    __host__ __device__ virtual ~BetterVector() {
#ifndef __CUDA_ARCH__
        vector.~VecType();
#endif
    }

    __host__ __device__ typename VecType::const_reference operator[](size_t index) const {
#ifdef __CUDA_ARCH__
        return raw[index];
#else
        return vector[index];
#endif
    }

    __host__ __device__ size_t size() const {
#ifdef __CUDA_ARCH__
        return cachedSize;
#else
        return vector.size();
#endif
    }
}

The central issue here seems to be that by using the trick of placing items in union so that constructors and destructors are not automatically called, you have prevented proper initialization of vector , and your constructor(s) are not accomplishing that.这里的中心问题似乎是,通过使用将项目放置在union中的技巧,以便不会自动调用构造函数和析构函数,您已经阻止了vector的正确初始化,并且您的构造函数没有实现这一点。

  1. For the first part of the test code, up through the CUDA kernel call, there is one constructor of interest for this particular observation, here:对于测试代码的第一部分,通过 CUDA kernel 调用,这里有一个对这个特定观察感兴趣的构造函数:

     __host__ explicit BetterVector(std::vector<T> vec): vector(vec), raw(vector.data()), cachedSize(vec.size()) {}

    My claim is vector(vec) is not properly constructing vector .我的主张是vector(vec)没有正确构造vector I suspect this revolves around the use of the union , wherein the defined constructor is not called (and possibly instead a copy-initializer is used, but this is not clear to me).我怀疑这与union的使用有关,其中未调用定义的构造函数(并且可能使用了复制初始化程序,但这对我来说并不清楚)。

    In any event, we can use a clue from the link you provided to resolve this:无论如何,我们可以使用您提供的链接中的线索来解决此问题:

constructor can be called through so called "placement new"可以通过所谓的“放置新”调用构造函数

  1. As mentioned in the comments, this copy operation cannot possibly be correct, it is only copying 1 byte:正如评论中提到的,这个复制操作不可能是正确的,它只复制了 1 个字节:

     cudaMemcpy(devVectorPtr, &vector, 1, cudaMemcpyHostToDevice); ^
  2. The device version of printf doesn't seem to be understanding the format specifier %zu , I replaced it with %lu printf的设备版本似乎不理解格式说明符%zu ,我将其替换为%lu

  3. It's not a problem per se, but it may be worthwhile to point out that this line of code:这本身不是问题,但值得指出的是这行代码:

     abstract::BetterVector<int> vector = *ptr;

    produces a separate BetterVector object in each thread , initialized from the object passed to the kernel.在每个线程中生成一个单独的BetterVector object,从传递给 kernel 的 object 初始化。

This level of "fixing" will get you to the point where your main code appears to run correctly up through the CUDA kernel launch.这一级别的“修复”将使您的main代码通过 CUDA kernel 启动似乎可以正常运行。 However the thrust code thereafter still has a problem that I haven't been able to sort out.但是此后的推力代码仍然存在我无法解决的问题。 The call to for_each if working properly should generate 3 kernel calls "under the hood" even though it is a host function, due to your code design (using a device_vector in thrust host path. Very odd.) Anyway I'm not able to sort that out for you, but I can say that the 3 kernel calls each trigger a call to your __host__ __device__ constructor (as well as the corresponding destructor), which doesn't surprise me.如果工作正常,对for_each的调用应该生成 3 个 kernel 调用“在引擎盖下”,即使它是主机 function,由于您的代码设计(在推力主机路径中使用device_vector 。非常奇怪。)无论如何我无法为您解决这个问题,但我可以说 3 kernel 调用每个触发器都会调用您的__host__ __device__构造函数(以及相应的析构函数),这并不让我感到惊讶。 Thrust is passing a BetterVector object via pass-by-value to each kernel launch, and doing so triggers a constructor/destructor sequence to support the pass by value operation. Thrust 通过值传递将BetterVector object 传递给每个 kernel 启动,这样做会触发构造函数/析构函数序列以支持传递值操作。 So given that we had to jump through hoops to get the previous constructor "working", there may be an issue in that sequence.因此,考虑到我们必须跳过障碍才能让前面的构造函数“工作”,那么该序列中可能存在问题。 But I haven't been able to pinpoint the problem.但我一直无法查明问题所在。

Anyway here is an example that has the items above addressed:无论如何,这是一个包含上述项目的示例:

$ cat t37.cu
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/functional.h>

namespace abstract {
template<typename T>
    struct equal_to : public thrust::unary_function<T, bool> {
        T lhs;

        __device__ __host__ explicit equal_to(T lhs) : lhs(lhs) {}

        __device__ __host__ bool operator()(T rhs) {
            return lhs == rhs;
        }
    };
template<typename T, typename VecType = thrust::device_vector<T>>
class BetterVector {
protected:
    typename VecType::pointer raw;
    size_t cachedSize;
    union {
        VecType vector;
    };

public:

    __host__ BetterVector() : vector(), raw(vector.data()), cachedSize(0) {}

    __host__ explicit BetterVector(size_t size) : vector(size), raw(vector.data()), cachedSize(size) {}

    __host__ explicit BetterVector(VecType vec) : vector(vec), raw(vector.data()), cachedSize(vec.size()) {}

//    __host__ explicit BetterVector(std::vector<T> vec) : vector(vec), raw(vector.data()), cachedSize(vec.size()) {}
    __host__ explicit BetterVector(std::vector<T> vec) : cachedSize(vec.size()) { new (&vector) VecType(vec); raw = vector.data();}

    __host__ __device__ BetterVector(const BetterVector &otherVec) :
#ifndef __CUDA_ARCH__
            vector(otherVec.vector),
#endif
            cachedSize(otherVec.cachedSize), raw(otherVec.raw) {}


    __host__ __device__ virtual ~BetterVector() {
#ifndef __CUDA_ARCH__
        vector.~VecType();
#endif
    }

    __host__ __device__ typename VecType::const_reference operator[](size_t index) const {
#ifdef __CUDA_ARCH__
        return raw[index];
#else
        return vector[index];
#endif
    }

    __host__ __device__ size_t size() const {
#ifdef __CUDA_ARCH__
        return cachedSize;
#else
        return vector.size();
#endif
    }
};
}


struct thrust_functor {
    abstract::BetterVector<int> vector;

    explicit thrust_functor(const abstract::BetterVector<int> &vector) : vector(vector) {}

    __host__ void operator()(int i) {
        printf("Thrust functor index %d: %d\n", i, (int) vector[i]);
    }
};

__global__ void baseCudaPrint(abstract::BetterVector<int>* ptr) {
    const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
    abstract::BetterVector<int> vector = *ptr;
    printf("Cuda kernel index %lu: %d\n", i, (int) vector[i]);
}


int main() {
        // these indented lines mysteriously "fix" the thrust problems
        thrust::device_vector<int> x1(4,1);
        thrust::device_vector<int> x2(x1);
        //
    abstract::BetterVector<int> vector({1, 2, 3, 4});
    for (int i = 0; i < 4; i++) {
        printf("Host index %d: %d\n", i, (int) vector[i]);
    }
    printf("\n");

    abstract::BetterVector<int>* devVectorPtr;
    cudaMalloc(&devVectorPtr, sizeof(abstract::BetterVector<int>));
    cudaMemcpy(devVectorPtr, &vector, sizeof(abstract::BetterVector<int>), cudaMemcpyHostToDevice);
    baseCudaPrint<<<1, vector.size()>>>(devVectorPtr);
    cudaDeviceSynchronize();
    cudaFree(devVectorPtr);
    printf("\n");

    thrust::counting_iterator<int> first(0);
    thrust::counting_iterator<int> last = first + vector.size();
    thrust::for_each(thrust::host, first, last, thrust_functor(vector));
    cudaDeviceSynchronize();
    printf("\n");
}
$ nvcc -std=c++14 t37.cu -o t37 -lineinfo -arch=sm_70
$ cuda-memcheck ./t37
========= CUDA-MEMCHECK
Host index 0: 1
Host index 1: 2
Host index 2: 3
Host index 3: 4

Cuda kernel index 0: 1
Cuda kernel index 1: 2
Cuda kernel index 2: 3
Cuda kernel index 3: 4

Thrust functor index 0: 1
Thrust functor index 1: 2
Thrust functor index 2: 3
Thrust functor index 3: 4

========= ERROR SUMMARY: 0 errors
$

I'll also add a subjective comment that I think this code design is going to be troublesome (in case that is not clear already) and I would suggest that you consider another path for a "universal" vector.我还将添加一个主观评论,我认为这种代码设计会很麻烦(如果还不清楚的话),我建议您考虑另一种“通用”向量的路径。 To pick just one example, your method for allowing access via host code using the thrust-provided [] operator is going to be horribly slow.仅举一个例子,使用推力提供的[]运算符允许通过主机代码访问的方法将非常缓慢。 That will invoke a separate cudaMemcpy for each item accessed that way.这将为以这种方式访问的每个项目调用一个单独的cudaMemcpy Anyway, good luck!无论如何,祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM