Cast from double to size_t yields wrong result?

Question

The following code works. My question is, should 2) not lead to a result very close to 1)? Why is 2) casted to such a small amount? Whereby, maybe worth to note 2) is exactly half of 1):

std::cout << "1)  " << std::pow(2, 8 * sizeof(size_t)) << std::endl;
std::cout << "2)  " << static_cast<size_t>(std::pow(2, 8 * sizeof(size_t))) << std::endl;

The output is:

18446744073709551616
9223372036854775808

Answer 1

It is due to that part of the specification:

7.3.10 Floating-integral conversions [conv.fpint]

A prvalue of a floating-point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.

The value 18446744073709551616 (that's the truncated part) is larger than std::numberic_limit<size_t>::max() on your system, and due to that, the behavior of that cast is undefined.

Answer 2

If we want to calculate the amount of different values a certain unsigned integral datatype can represent we can calculate

 std::cout << "1)  " << std::pow(2, 8 * sizeof(size_t)) << std::endl; // yields 18446744073709551616

This calculates 2 to the power of 64 and yields 18446744073709551616. Since sizeof(size_t) is 8 byte, on a 64 bit machine, and a byte has 8 bit, the width of the size_t data type is 64 bit hence 2^64.

This is no surprise since usually it is the case that size_t on a system has the width of its underlying hardware bus system since we want to consume no more than one clock cycle to deliver an address or an index of an array or vector.

The above number represents the amount of all different integral values that can be represented by an unsigned integral datatype of 64 bit like size_t or unsigned long long including 0 as one possibility. And since it does include 0, the highest value to be represented is exactly one less, so 18446744073709551615.

This number can also be retrieved by

 std::cout << std::numeric_limits<size_t>::max() << std::endl; // yields 18446744073709551615
 std::cout << std::numeric_limits<unsigned long long>::max() << std::endl; // yields the same

Now an unsigned datatype stores its values like

   00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 is 0 
   00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001 is 1 or 2^0
   00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000010 is 2 or 2^1
   00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000011 is 3 or 2^1+2^0
   00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000100 is 4 or 2^2
   ...
   11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 is 18446744073709551615
   and if you want to add another 1, you would need a 65th bit on the left which you dont have:
 1 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 is 0 because 
   there are no more bits on the left.

Any amount higher than the highest possible value you would wish to represent will come down to amount modulo the largest possible value + 1. (amount % (max + 1)) which leads as we can see to zero in above sample.

And since this comes so naturally the standard defines that if you convert any integral datatype signed or unsigned to another unsigned integral datatype it is to be converted amount modulo the largest possible value + 1. Beautiful.

But this easy rule has a little surprise for us when we wish to convert a negative integral to an unsigned integral like -1 to unsigned long long for eaxample. You have a 0 value first and then you deduct 1. What happens is the oposite sequence of the above sample. Have a look:

  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 is 0 and now do -1
  11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 is 18446744073709551615

So yes, converting -1 to size_t leads to std::numeric_limits<size_t>::max(). Quite unbelievable at first but understandable after some thinking and playing around with it.

Now for our second line of code

 std::cout << "2)  " << static_cast<size_t>(std::pow(2, 8 * sizeof(size_t))) << std::endl;

we would expect naively 18446744073709551616, the same result as line one, of course.

But since we know now about modulo the largest + 1 and we know now that the largest plus one gives 0 we would also, again naively, accept 0 as an answer.

Why naively? Because std::pow returns a double and not an integral datatype. The double datatype is again 64 bit but internally its representation is entirely different.

 0XXXXXXX XXXX0000 00000000 00000000 00000000 00000000 00000000 00000000

Only those 11 X bits represent the exponent in 2^n form. That means only those 11 bits have to show 64 and the double will represent 2^64 * 1. So the representation of our big number is much more compact in double than in size_t. Would someone want to do modulo the largest plus 1 some more conversion would be needed before to change the representation of 2^64 into a 64 bit line.

Some further reading about floating point representation can be found at https://docs.microsoft.com/en-us/cpp/build/ieee-floating-point-representation?view=msvc-160 for example.

And the standard says that if you convert a floating value to an integral which cannot be represented by the target integral datatype the result is UB, undefined behaviour.

See the C++17 Standard ISO/IEC14882: 7.10 Floating-integral conversions [conv.fpint]

A prvalue of a floating-point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. ...

So double can easily hold 2^64 and thats the reason why line 1 could print out so easily. But it is 1 too much to be represented in size_t so the result is UB. So whatever is the outcome of our line 2 is simply irrelevant because it is UB.

Ok, but if any random result will do, how come the UB outcome is exactly half? Well fist of all, the outcome is from MSVC. Clang or other compiler may deliver any other UB result.

But lets look at the "half" outcome since it is easy.

   Trying to add 1 to the largest  
   11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 is 18446744073709551615
   would if only integrals would be involved lead to, 
 1 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
   but thats not possible since the bit does not exist and it is not integral but double datatype and 
   hence UB, so accidentially the result is
   10000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 which is 9223372036854775808
   so exactly half of the naively expected or 2^63.

Cast from double to size_t yields wrong result?

Question

2 answers

solution1
11 ACCPTED 2020-11-25 21:26:26

solution2
0 2020-12-07 21:40:41

Cast from double to size_t yields wrong result?

Question

2 answers

solution1 11 ACCPTED 2020-11-25 21:26:26

solution2 0 2020-12-07 21:40:41

solution1
11 ACCPTED 2020-11-25 21:26:26

solution2
0 2020-12-07 21:40:41