简体   繁体   中英

convert uint8_t* to uint64_t

What's more recommended or advisable way to convert array of uint8_t at offset i to uint64_t and why?

uint8_t * bytes = ...
uint64_t const v = ((uint64_t *)(bytes + i))[0];

or

uint64_t const v = ((uint64_t)(bytes[i+7]) << 56)
                 | ((uint64_t)(bytes[i+6]) << 48)
                 | ((uint64_t)(bytes[i+5]) << 40)
                 | ((uint64_t)(bytes[i+4]) << 32)
                 | ((uint64_t)(bytes[i+3]) << 24)
                 | ((uint64_t)(bytes[i+2]) << 16)
                 | ((uint64_t)(bytes[i+1]) << 8)
                 | ((uint64_t)(bytes[i]));

There are two primary differences.

One, the behavior of ((uint64_t *)(bytes + i))[0] is not defined by the C standard (unless certain prerequisites about what bytes point to are met). Generally, an array of bytes should not be accessed using a uint64_t type.

When memory defined as one type is accessed with another type, it is called aliasing, and the C standard only defines certain combinations of aliasing. Some compilers may support some aliasing beyond what the standard requires, but using it is not portable. Additionally, if bytes + i is not suitably aligned for a uint64_t , the access may cause an exception or otherwise malfunction.

Two, loading the bytes through aliasing, if it is defined (by the standard or by compiler extension), interprets the bytes using the memory ordering for the C implementation. Some C implementations store the bytes representing integers in memory from low address to high address for low-position-value bytes to high-position-value bytes, and some store them from high address to low address. (And they can be stored in non-consecutive orders too, although this is rare.) So loading the bytes this way will produce different values from the same bytes in memory based on what order the C implementation uses.

But loading the bytes and using shifts to combine them will always produce the same value from the same bytes in memory regardless of what order the C implementation uses.

The first method should be avoided, because there is no need for it. If one desires to interpret the bytes using the C implementation's ordering, this can be done with:

uint64_t t;
memcpy(&t, bytes+i, sizeof t);
const uint64_t v = t;

Using memcpy provides a portable way of aliasing the uint64_t to store bytes into it. Good compilers recognize this idiom and will optimize the memcpy to a load from memory, if suitable for the target architecture (and if optimization is enabled).

If one desires to interpret the bytes using little-endian ordering, as shown in the code in the question, then the second method may be used. (Sometimes platforms will have routines that may provide more efficient code for this.)

You can also use memcpy

uint64_t n;
memcpy(&n, bytes + i, sizeof(uint64_t));
const uint64_t v = n;

The first option has two big problems that qualify as undefined behavior (anything can happen):

  • A uint8_t* or array of uint8_t is not necessarily aligned the same way as required by a larger type like uint64_t . Simply casting to uint64_t* leads to misaligned access. This can cause hardware exceptions, program crashes, slower code etc, all depending on the alignment requirements of the specific target.

  • It violates the internal type system of C, where each object in memory known by the compiler has an "effective type" that the compiler keeps track of. Based on this, the compiler is allowed to make certain assumptions regarding if a certain memory region have been accessed or not during optimization. If your code violates these type rules, as it would in this case, wrong machine code could get generated.

    This is most commonly referred to as the strict aliasing rule and your cast followed by dereferencing would be a so-called "strict aliasing violation".

The second option is sound code, because:

  • When doing shifts or other forms of bitwise arithmetic, a large integer type should be used. That is, unsigned int or larger - depending on system. Using signed types or small integer types can lead to undefined behavior or unexpected results. See Implicit type promotion rules regarding problems with small integer types implicitly changing signedness in some expressions.

    If not for the cast to uint64_t , then the bytes[i+7] << 56 shift would involve an implicit promotion of the left operand from uint8_t to int , which would be a bug. Because if the most significant bit (MSB) of the byte is set and we shift into/beyond the sign bit, we invoke undefined behavior - again, anything can happen.

    And naturally we need to use a 64 bit type in this specific case or otherwise we wouldn't be able to shift as far as 56 bits. Shifting beyond the range of the type of the left operand is also undefined behavior.

Note that whether to pick the order of bytes[i+7] << 56 versus the alternative bytes[i+0] << 56 depends on the underlying CPU endianess . Bit shifts are nice since the actual shift ignores if the destination type is using big or little endian. But in this case you must know in advance which byte in the source array you want to correspond to the most significant. This code you have here will work if the array was built based on little endian formatting, since the last byte of the array is shifted to the highest address.

As for the uint64_t const v = , the const qualifier is a bit strange to have at local scope like that. It's harmless but confusing and doesn't really add anything of value inside a local scope. I would just drop it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM