简体   繁体   中英

When using 3D cuda Memory is it better to pass the associated cudaPitchedPtr or just the raw pointer in the cudaPitchedPtr struct?

The example in the nvidia programming guide shows them passing the pitchedPtr to their kernel:

__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,int width, int height, int depth)

But instead of that why not just allocate in the same manner, but then call like:

__global__ void MyKernel(float* devPtr,int pitch, int width, int height, int depth)

and then access the elements however you like. I would prefer the latter implementation, but why does the programming guide give the other example (and albeit a bad example - illustrating how to access the elements but also illustrating a design pattern that should not be implemented with cuda).

Edit : meant to say that the float * devPtr is the ptr (void * ptr) member of the cudaPitchedPtr.

I assume your talking about cudaMalloc3D:

From the CUDA reference regarding cudaMalloc3D:

Allocates at least width * height * depth bytes of linear memory on the device and returns a cudaPitchedPtr in which ptr is a pointer to the allocated memory. The function may pad the allocation to ensure hardware alignment requirements are met.

So

cudaMalloc3D(&pitchedDevPtr, make_cudaExtent(w, h, d));

does:

cudaMalloc(&devPtr, w * h * d);

There is no difference to a call of cudaMalloc, but if you like it, you get some convenience. You don't have to calculate the size of your array by your own just pass a cudaExtent struct to the function. Ofcorse you get an array in bytes . There is no definition of the size of your data type specified in the cudaExtent structure.

If you pass your plain pointer, or your cudaPitchedPtr to the kernel is a design decision. Your cudaPitchedPtr delivers not only the devPtr to your kernel, it also stores the amount of memory and the size of the dimensions. For memory and so also register saving you get only the size in x and y direction, z is just pitch / (x * y).

EDIT: As pointed out cudaMalloc3D adds padding to assure coalesced memory access. But since Compute Capability 1.2 a memory access can by coalesced even if the starting address is not propperly aligned. On devices witch CC >= 1.2 there is no difference between those two allocations regarding performance.

Either method is equally valid - it is purely an aesthetic decision on your part.

It is not even clear to me why cudaPitchedPtr has extra members - the only ones that really matter are the base pointer and the pitch.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM