简体   繁体   中英

IEEE floating point addition

instead of

float a = 32.342 , b = 193.132
float total = a + b

How can I convert them to 32 bits and add using their 32 bits explicitly?

If I understand you right (you want convert both float to byte arrays then sum these arrays together), you can implement something like this:

  // Initial floats 
  float a = 32.342f, b = 193.132f; // do not forget "f" suffix 
  float total = a + b;

  // floats as byte[4] arrays
  byte[] aArray = BitConverter.GetBytes(a);
  byte[] bArray = BitConverter.GetBytes(b);
  // let's compare actual float addition with arrays summation 
  byte[] totalArray = BitConverter.GetBytes(total);

  // Add arrays directly: we may want to convert them into Int32, 
  // add up them up and, finally, convert back to array
  // Reverse().ToArray(): we should take Ending into account  
  int c = unchecked(BitConverter.ToInt32(aArray.Reverse().ToArray(), 0) + 
                    BitConverter.ToInt32(bArray.Reverse().ToArray(), 0));
  byte[] cArray = BitConverter.GetBytes(c).Reverse().ToArray();

Visualization (let's see all these bits):

  private static String ToReport(byte[] data) {
    return String.Join(" ", data.Select(x => Convert.ToString(x, 2).PadLeft(8, '0')));
  }

  ...

  String text = String.Join(Environment.NewLine,
    $"a:     {a,7} {ToReport(aArray)}",
    $"b:     {b,7} {ToReport(bArray)}",
    $"a + b:         {ToReport(cArray)}",
    $"total: {total,7} {ToReport(totalArray)}");

  Console.Write(text);

Outcome:

  a:      32.342 00110101 01011110 00000001 01000010
  b:     193.132 11001011 00100001 01000001 01000011
  a + b:         00000000 01111111 01000010 10000101 // array + array
  total: 225.474 01011000 01111001 01100001 01000011 // float + float

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM