简体   繁体   中英

How to understand the result of SASS analysis in CUDA/GPU

I used cuobjdump, one of the CUDA Binary Utilities, to generate the SASS code, sample results are as below. These codes try to load from global memory.

/*0028*/ IMAD R6.CC, R3, R5, c[0x0][0x20]; 
/*0030*/ IMAD.HI.X R7, R3, R5, c[0x0][0x24]; 
/*0040*/ LD.E R2, [R6]; //load
  1. Where can i get the full manual of SASS code that explain the meaning of each instruction. In "cuda binary utility " , It only provide a general explanation of the meaning of the instruction. eg it doesn't explain the meaning of "R1.cc", "IMAD.HI.X" and LD.e .

  2. What is meaning of second instruction. I guess that the first one is to compute the memory address that each thread should load, while the third instruction is to load global memory into register. I have no idea on the meaning of second instruction.

  3. I guess that cuda save some parameter information like grid size, block size and array base address into constant memory. In this case, c[0x0][0x20] is the base address of an array. My question is how can i get those information.

  1. Where can i get the full manual of SASS code that explain the meaning of each instruction.

As far as I know there is no such thing, SASS is mostly undocumented (there's only a basic reference ), as it varies between architectures. However, PTX is thoroughly documented and many SASS instructions have a close PTX equivalent, from which you can extrapolate the meaning. You may also want to dump the SASS with source information to better understand what is going on.

But given these two documents, you can more or less translate the SASS back to PTX and guess the meaning of the instructions:

/*0028*/ IMAD R6.CC, R3, R5, c[0x0][0x20];

Extended-precision integer multiply-add : multiply R3 with R5, sum with constant in bank 0, offset 0x20, store in R6 with carry-out.

/*0030*/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];

Integer multiply-add with extract : multiply R3 with R5, extract upper half, sum that upper half with constant in bank 0, offset 0x24, store in R7 with carry-in.

/*0040*/ LD.E R2, [R6]; //load

Load : load into R2 what is pointed-to by the register pair R7:R6.

As @njuffa explains in the comment below :

The entire computation multiplies R3 with R5, adds the 64-bit product to the 64-bit constant in c[0][24]:c[0][20], and uses the resulting 64-bit address to retrieve R2.

  1. I guess that cuda save some parameter information like grid size, block size and array base address into constant memory. [...] My question is how can i get those information.

Where the builtins ( threadIdx , blockIdx , blockDim , gridDim , etc) reside is unspecified and may vary between architectures. In practice, some of them are in special-purpose registers, some others are in shared memory. But that's an implementation detail.

Note : Edited to integrate @njuffa's comment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM