3 questions about alignment

Question

The discussion is restricted to compute capability 2.x

Question 1

The size of a curandState is 48 bytes (measured by sizeof()). When an array of curandStates is allocated, is each element somehow padded (for example, to 64 bytes)? Or are they just placed contiguously in the memory?

Question 2

The OP of Passing structs to CUDA kernels states that "the align part was unnecessary". But without alignment, access to that structure will be divided into two consecutive access to a and b. Right?

Question 3

struct
{
    double x, y, z;
}Position

Suppose each thread is accessing the structure above:

int globalThreadID=blockIdx.x*blockDim.x+threadIdx.x;
Position positionRegister=positionGlobal[globalThreadID];

To optimize memory access, should I simply use three separate double variables x, y, z to replace the structure?

Thanks for your time!

Answer 1

(1) They are placed contiguously in memory.

(2) If the array is in global memory, each memory transaction is 128 bytes, aligned to 128 bytes. You get two transactions only if a and b happen to span a 128-byte boundary.

(3) Performance can often be improved by using an struct of arrays instead of an array of structs. This justs means that you pack all your x together in an array, then y and so on. This makes sense when you look at what happens when all 32 threads in a warp get to the point where, for instance, x is needed. By having all the values packed together, all the threads in the warp can be serviced with as few transactions as possible. Since a global memory transaction is 128 bytes, this means that a single transaction can service all the threads if the value is a 32-bit word. The code example you gave might cause the compiler to keep the values in registers until they are needed.

3 questions about alignment

Question

1 answers

solution1
1 ACCPTED 2012-09-09 04:52:43

3 questions about alignment

Question

1 answers

solution1 1 ACCPTED 2012-09-09 04:52:43

solution1
1 ACCPTED 2012-09-09 04:52:43