简体   繁体   English

关于对齐的3个问题

[英]3 questions about alignment

The discussion is restricted to compute capability 2.x 讨论仅限于计算能力2.x

Question 1 问题1

The size of a curandState is 48 bytes (measured by sizeof()). curandState的大小为48个字节(由sizeof()测量)。 When an array of curandStates is allocated, is each element somehow padded (for example, to 64 bytes)? 分配curandStates数组时,是否以某种方式填充了每个元素(例如,填充为64个字节)? Or are they just placed contiguously in the memory? 还是它们只是连续地放在内存中?

Question 2 问题2

The OP of Passing structs to CUDA kernels states that "the align part was unnecessary". 结构传递给CUDA内核的操作指出,“不需要对齐部分”。 But without alignment, access to that structure will be divided into two consecutive access to a and b. 但是如果不对齐,对该结构的访问将分为对a和b的两个连续访问。 Right? 对?

Question 3 问题3

struct
{
    double x, y, z;
}Position

Suppose each thread is accessing the structure above: 假设每个线程都在访问上面的结构:

int globalThreadID=blockIdx.x*blockDim.x+threadIdx.x;
Position positionRegister=positionGlobal[globalThreadID];

To optimize memory access, should I simply use three separate double variables x, y, z to replace the structure? 为了优化内存访问,我是否应该简单地使用三个单独的双变量x,y,z替换结构?

Thanks for your time! 谢谢你的时间!

(1) They are placed contiguously in memory. (1)它们连续地放在内存中。

(2) If the array is in global memory, each memory transaction is 128 bytes, aligned to 128 bytes. (2)如果数组在全局内存中,则每个内存事务为128字节,对齐为128字节。 You get two transactions only if a and b happen to span a 128-byte boundary. 仅当ab恰好跨越128字节边界时,您才获得两个事务。

(3) Performance can often be improved by using an struct of arrays instead of an array of structs. (3)通常可以通过使用数组结构而不是结构数组来提高性能。 This justs means that you pack all your x together in an array, then y and so on. 这意味着您将所有x打包在一起,然后是y ,依此类推。 This makes sense when you look at what happens when all 32 threads in a warp get to the point where, for instance, x is needed. 当您查看经线中的所有32个线程到达需要x的点时会发生什么,这很有意义。 By having all the values packed together, all the threads in the warp can be serviced with as few transactions as possible. 通过将所有值打包在一起,可以使用尽可能少的事务为扭曲中的所有线程提供服务。 Since a global memory transaction is 128 bytes, this means that a single transaction can service all the threads if the value is a 32-bit word. 由于全局内存事务为128字节,因此,如果该值是32位字,则单个事务可以为所有线程提供服务。 The code example you gave might cause the compiler to keep the values in registers until they are needed. 您提供的代码示例可能会导致编译器将值保留在寄存器中,直到需要它们为止。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM