cuda推力device_vector调整大小时地址未对齐

Question

This is odd... thrust::device_vector.resize throws with cudaErrorMisalignedAddress, but only if I first call curandGenerateNormal with a start address not aligned to 8 bytes: 这很奇怪...推力:: device_vector.resize与cudaErrorMisalignedAddress一起抛出，但仅当我首先使用起始地址未对齐8个字节的curandGenerateNormal调用时才抛出：

#include <cuda_runtime.h>
#include <curand.h>
#include <thrust/device_vector.h>
#include <assert.h>

int main()
{
    thrust::device_vector<float> a(16), b(0);

    curandGenerator_t _prng;
    curandStatus_t curandStat = curandCreateGenerator(&_prng, CURAND_RNG_PSEUDO_DEFAULT);
    assert(curandStat == CURAND_STATUS_SUCCESS);

    bool breakCUDA = true;

    if (breakCUDA) {
        // this curand call (not 8-byte aligned) somehow breaks subsequent resize
        float *start_p1 = a.data().get() + 1;
        curandStat = curandGenerateNormal(_prng, start_p1, 2, 0.0f, 1.0f);
        assert(curandStat == CURAND_STATUS_SUCCESS);
    }
    else {
         // this one, using an 8-byte aligned pointer works fine
         float *start = a.data().get();
         curandStat = curandGenerateNormal(_prng, start, 2, 0.0f, 1.0f);
         assert(curandStat == CURAND_STATUS_SUCCESS);
    }

    // note: either call above returns CURAND_STATUS_SUCCESS

    // but this throws thrust::system_error with error cudaErrorMisalignedAddress
    // if the unaligned pointer was used before
    b.resize(16);
}

In my real code I need to use different generation parameters (the 0.0f, 1.0f) on different segments of the first vector, and the segment boundaries are not necessarily memory aligned. 在我的真实代码中，我需要在第一个向量的不同段上使用不同的生成参数（0.0f，1.0f），并且段边界不一定与内存对齐。

The doc for curandGenerateNormal says the length has to be even (as it is in both cases) but doesn't mention anything about alignment. curandGenerateNormal的文档说长度必须是偶数（在两种情况下都是这样），但是没有提及对齐。

I have a workaround now: I check if the pointer I'm about to pass to curandGenerateNormal is aligned to 8 bytes and if not I generate to some temporary memory and copy it. 我现在有一个解决方法：我检查要传递给curandGenerateNormal的指针是否对齐8个字节，如果不是，则生成到一些临时内存并将其复制。 But I'd appreciate it if anyone has any more insight into what is going on so I can make sure I do the right thing in the future. 但是，如果有人对正在发生的事情有更多的了解，我将不胜感激，这样我可以确保以后做对的事情。 Are there any other thrust or curand methods where I have to be careful about alignment? 在对齐时是否还需要注意其他推力或curand方法？

This is CUDA 6.5 on Windows. 这是Windows上的CUDA 6.5。

Thanks. 谢谢。

Answer 1

I think the fundamental issue is that curandGenerateNormal is expecting to write a quantity that is aligned to twice the fundamental data type ( float , in this case). 我认为基本问题是curandGenerateNormal期望写入的数量与基本数据类型（在本例中为float ）的两倍对齐。 Therefore, the pointer you pass to curandGenerateNormal , when using a PRNG such as the default XORWOW generator, should be aligned to twice the data type (ie 8-byte aligned in this case, or 16-byte aligned in the case of curandGenerateNormalDouble , for example). 因此，当使用PRNG（例如默认的XORWOW生成器）时，传递给curandGenerateNormal的指针应与数据类型的两倍对齐（即，在这种情况下为8字节对齐，在curandGenerateNormalDouble的情况下为16字节对齐）。例）。 I don't believe the issue has anything to do with thrust. 我认为这个问题与推力无关。

Although the issue is not well documented that I can see, a hint of it may be found in the documentation you linked : 尽管我看不到该问题的详细记录，但是在您链接的文档中可能会发现一个提示：

Normally distributed results are generated from pseudorandom generators with a Box-Muller transform, and so require n to be even. 正态分布的结果是通过Box-Muller变换从伪随机生成器生成的，因此要求n为偶数。

Let's consider a slightly different test case, to prove that thrust is not at issue, and to take a look at what is going on under the hood: 让我们考虑一个稍微不同的测试用例，以证明推力不成问题，并了解引擎盖下的情况：

$ cat t625.cu
#include <curand.h>
#include <iostream>
#define DSIZE 4

int main(){

  curandGenerator_t _prng;
  curandStatus_t curandStat = curandCreateGenerator(&_prng, CURAND_RNG_PSEUDO_DEFAULT);
  float *h_a, *d_a;
  h_a = (float *)malloc(DSIZE*sizeof(float));
  cudaMalloc(&d_a, DSIZE*sizeof(float));
  cudaMemset(d_a, 0, DSIZE*sizeof(float));
  float *start_p1 = d_a+ 1;
  curandStat = curandGenerateNormal(_prng, start_p1, 2, 0.0f, 1.0f);
  cudaMemcpy(h_a, d_a, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  for (int i = 0; i < DSIZE; i++)
    std::cout << h_a[i] << std::endl;
  return 0;
}
[user2@dc20 misc]$ vi t625.cu
[user2@dc20 misc]$ nvcc -arch=sm_20 -o t625 t625.cu -lcurand
[user2@dc20 misc]$ cuda-memcheck ./t625
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
=========     at 0x000003e8 in void gen_sequenced<curandStateXORWOW, float2, normal_args_st, __operator_&__(float2 curand_normal_scaled2<curandStateXORWOW>(curandStateXORWOW*, normal_args_st))>(curandStateXORWOW*, float2*, unsigned long, unsigned long, normal_args_st)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x13047c0004 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x14ad95]
=========     Host Frame:/usr/local/cuda/lib64/libcurand.so.6.5 [0x726d8]
=========     Host Frame:/usr/local/cuda/lib64/libcurand.so.6.5 [0x9b923]
=========     Host Frame:/usr/local/cuda/lib64/libcurand.so.6.5 [0xfc95]
=========     Host Frame:/usr/local/cuda/lib64/libcurand.so.6.5 (curandGenerateNormal + 0x1ee7) [0x3b987]
=========     Host Frame:./t625 [0x27a2]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ecdd]
=========     Host Frame:./t625 [0x2639]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so.1 [0x2ef613]
=========     Host Frame:./t625 [0x37fdf]
=========     Host Frame:./t625 [0x27c2]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ecdd]
=========     Host Frame:./t625 [0x2639]
=========
1.14162e-37
0
7.40782e-38
0
========= ERROR SUMMARY: 2 errors
$

(I am working in linux, but I wouldn't expect any difference between windows and linux here.) （我在linux中工作，但是我不希望Windows和linux在这里有任何区别。）

The above program generates basically the same error. 上面的程序产生基本上相同的错误。 Therefore we can conclude that thrust is not necessary to see the problem. 因此，我们可以得出结论，不需要推力即可看到问题。 Taking a closer look at the cuda-memcheck output, we see: 仔细查看cuda-memcheck输出，我们看到：

========= Invalid __global__ write of size 8
=========     at 0x000003e8 in void gen_sequenced<curandStateXORWOW, float2, normal_args_st, __operator_&__(float2 curand_normal_scaled2<curandStateXORWOW>(curandStateXORWOW*, normal_args_st))>(curandStateXORWOW*, float2*, unsigned long, unsigned long, normal_args_st)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x13047c0004 is misaligned

The gen_sequenced is a kernel call that is contained within the host API function curandGenerateNormal . gen_sequenced是内核调用，包含在主机API函数curandGenerateNormal 。 It is attempting to write an 8-byte quantity, which must (by CUDA requirement) be on an 8-byte aligned boundary. 它正在尝试写入一个8字节的数量，该数量必须（按CUDA要求）在8字节对齐的边界上。 As you've already indicated, the pointer being passed is 4-byte aligned but not 8-byte aligned, in the failing case. 正如您已经指出的，在失败的情况下，传递的指针是4字节对齐的，而不是8字节对齐的。 Furthermore, we see that this kernel under the hood is using a float2 quantity. 此外，我们看到引擎盖下的内核正在使用float2数量。 This is undoubtedly an optimization done since it's known that the quantity n must be even. 无疑，这是一个优化，因为已知数量n必须是偶数。 A float2 quantity can only be accessed on an 8-byte boundary. float2数量只能在8字节边界上访问。

The conclusion therefore, although it doesn't seem to be explicitly documented, seems to be that for the cases covered by this statement: 因此，尽管似乎没有明确记录结论，但结论似乎是针对此声明涵盖的情况：

Normally distributed results are generated from pseudorandom generators with a Box-Muller transform, and so require n to be even. 正态分布的结果是通过Box-Muller变换从伪随机生成器生成的，因此要求n为偶数。

the pointer passed must be aligned to twice the fundamental datatype. 传递的指针必须与基本数据类型的两倍对齐。 I will file a notice with NVIDIA to request that the documentation be updated. 我将向NVIDIA发出通知，要求更新文档。

Regarding error reporting, the actual error that occurs (misaligned pointer) as detected by the CUDA kernel will be detected asynchronously, and will not be reported at the time of kernel launch (the gen_sequenced kernel, in this case). 关于错误报告，CUDA内核检测到的实际错误（指针未对齐）将被异步检测，并且在内核启动时（在这种情况下为gen_sequenced内核）将不会报告。 It will be reported subsequently at some future point, when the CUDA error status is checked. 当检查CUDA错误状态时，将在以后的某个时间报告该错误。 This may explain why the curand function itself returns a positive result. 这可以解释为什么curand函数本身返回肯定的结果。 Thrust has runtime error handling built in, so a previously occurring CUDA error of this type will be "caught" by Thrust and reported, even though (as in this case) it may have nothing to do with Thrust, per se. Thrust内置了运行时错误处理，因此先前发生的此类CUDA错误将被Thrust捕获并报告，即使（在这种情况下）与Thrust本身无关。

cuda推力device_vector调整大小时地址未对齐

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-12-27 19:17:05

cuda推力device_vector调整大小时地址未对齐

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-12-27 19:17:05

解决方案1
1 已采纳 2014-12-27 19:17:05