简体   繁体   中英

How many 8 bit operations can be performed on 32 bit ALU of a GPU in one cycle if the IPC is 1?

Can it perform four 8 bit operations (SIMD operations) per cycle or just one? Conventionally the higher bits are made zeros and 8 bit is treated as 32 bit word with its higher bits as zero to perform such an operation. Is there any hardware feature available at present in processors that can help more number of lower bit operations to be performed per cycle (especially in NVIDIA GPUs)?

AFAIK there aren't any arithmetic instructions on a GPU that "can be performed on 32 bit ALU of a GPU in one cycle" Most arithmetic functional units on a GPU are pipelined resulting in latencies of around 5-25 clock cycles . A unit can have a new operation issued to it per clock, and it can retire an operation per clock, but it cannot perform an operation "in one cycle".

The GPU has simd vector intrinsics , some of which are similar to what you are describing. The throughput of these will vary by specific GPU type as well as specific operation type.

So, for example, the throughput, on kepler, of the vabsdiff4 SIMD intrinsic (which does four 8-bit arithmetic operations on a 4 byte vector quantity packed into a 32-bit word) should be approximately the same throughput as a 32-bit integer operation (add, subtract, etc.) Most other SIMD intrinsics will have lower throughputs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM