Does int32_t have lower latency than int8_t, int16_t and int64_t?

Question

(I'm referring to Intel CPUs and mainly with GCC, but poss ICC or MSVC)

Is it true using int8_t , int16_t or int64_t is less efficient compared with int32_t due to additional instructions generated to to convert between the CPU word size and the chosen variable size?

I would be interested if anybody has any examples or best practices for this? I sometimes use smaller variable sizes to reduce cacheline loads, but say I only consumed 50 bytes of a cacheline with one variable being 8-bit int, it may be quicker processing by using the remaining cacheline space and promote the 8-bit int to a 32-bit int etc?

Answer 1

You can stuff more uint8_t s into a cache line, so loading N uint8_t s will be faster than loading N uint32_t s.

In addition, if you are using a modern Intel chip with SIMD instructions, a smart compiler will vectorize what it can. Again, using a small variable in your code will allow the compiler to stuff more lanes into a SIMD register.

I think it is best to use the smallest size you can, and leave the details up to the compiler. The compiler is probably smarter than you (and me) when it comes to stuff like this. For many operations (say unsigned addition), the compiler can use the same code for uint8 , uint16 or uint32 (and just ignore the upper bits), so there is no speed difference.

The bottom line is that a cache miss is WAY more expensive than any arithmetic or logical operation, so it is nearly always better to worry about cache (and thus data size) than simple arithmetic.

(It used to be true a long time again that on Sun workstation, using double was significantly faster than float , because the hardware only supported double . I don't think that is true any more for modern x86, as the SIMD hardware (SSE, etc) have direct support for both single and double precision).

Answer 2

Mark Lakata answer points in the right direction.
I would like to add some points.

A wonderful resource for understanding and taking optimization decision are the Agner documents.

The Instruction Tables document has the latency for the most common instructions. You can see that some of them perform better in the native size version.
A mov for example may be eliminated, a mul have less latency.
However here we are talking about gaining 1 clock , we would have to execute a lot of instruction to compensate for a cache miss.
If this were the whole story it would have not worth it.

The real problems comes with the decoders.
When you use some length-changing prefixes (and you will by using non native size word) the decoder takes extra cycles.

The operand size prefix therefore changes the length of the rest of the instruction. The predecoders are unable to resolve this problem in a single clock cycle. It takes 6 clock cycles to recover from this error. It is therefore very important to avoid such length-changing prefixes.

In, nowadays, no longer more recent (but still present) microarchs the penalty was severe, specially with some kind arithmetic instructions.
In later microarchs this has been mitigated but the penalty it is still present.

Another aspect to consider is that using non native size requires to prefix the instructions and thereby generating larger code. This is the closest as possible to the statement " additional instructions [are] generated to to convert between the CPU word size and the chosen variable size " as Intel CPU can handle non native word sizes.
With other, specially RISC, CPUs this is not generally true and more instruction can be generated.

So while you are making an optimal use of the data cache, you are also making a bad use of the instruction cache .

It is also worth nothing that on the common x64 ABI the stack must be aligned on 16 byte boundary and that usually the compiler saves local vars in the native word size or a close one (eg a DWORD on 64 bit system).
Only if you are allocating a sufficient number of local vars or if you are using array or packed structs you can gain benefits from using small variable size.
If you declare a single uint16_t var, it will probably takes the same stack space of a single uint64_t , so it is best to go for the fastest size.

Furthermore when it come to the data cache it is the locality that matters, rather than the data size alone.

So, what to do?

Luckily, you don't have to decide between having small data or small code.

If you have a considerable quantity of data this is usually handled with arrays or pointers and by the use of intermediate variables. An example being this line of code.

t = my_big_data[i];

Here my approach is:

Keep the external representation of data, ie the my_big_data array, as small as possible. For example if that array store temperatures use a coded uint8_t for each element.
Keep the internal representation of data, ie the t variable, as close as possible to the CPU word size. For example t could be a uint32_t or uint64_t .

This way you program optimize both caches and use the native word size.
As a bonus you may later decide to switch to SIMD instructions without have to repack the my_big_data memory layout.

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
D. Knuth

When you design your structures memory layout be problem driven. For example, age values need 8 bit, city distances in miles need 16 bits.
When you code the algorithms use the fastest type the compiler is known to have for that scope. For example integers are faster than floating point numbers, uint_fast8_t is no slower than uint8_t .

When then it is time to improve the performance start by changing the algorithm (by using faster types, eliminating redundant operations, and so on) and then if it is needed the data structures (by aligning, padding, packing and so on).

Does int32_t have lower latency than int8_t, int16_t and int64_t?

Question

2 answers

solution1
5 2015-07-04 21:05:21

solution2
4

So, what to do?

Does int32_t have lower latency than int8_t, int16_t and int64_t?

Question

2 answers

solution1 5 2015-07-04 21:05:21

solution2 4

So, what to do?

solution1
5 2015-07-04 21:05:21

solution2
4