在仅支持 32 位浮点数的平台上，如何将 IEEE754 64 位双精度数除以 1000？

Question

I've got an electricity meter connected to a DCS (distributed control system) by PROFIBUS.我有一个电表通过 PROFIBUS 连接到 DCS（分布式控制系统）。 The meter (Siemens Sentron PAC3200) supplies its count as an IEEE 754 double in Wh (watt-hours).仪表 (Siemens Sentron PAC3200) 以 IEEE 754 双倍 Wh（瓦时）提供其计数。 Also, the counter overflows at 1.0e12 Wh or 1,000 GWh.此外，计数器在 1.0e12 Wh 或 1,000 GWh 时溢出。 (Cutaway scene: Several years earlier, Siemens development labs. "Let's see, how to transfer a 40-bit unsigned integer value? Let's use double!") （截图场景：几年前，西门子开发实验室。“让我们看看，如何传输一个 40 位无符号整数值？让我们使用 double！”）

My goal is to log the count consistently in kWh precision.我的目标是以 kWh 精度始终如一地记录计数。

The DCS however only supports single precision floats.然而，DCS 仅支持单精度浮点数。 So if I took the direct route, ie squeezed the data into a float, then at about seven decimal digits errors would appear in the kWh reading, ie at the latest from about 100,000,000 Wh or 100 MWh.因此，如果我采用直接路线，即将数据压缩到浮点数中，那么在千瓦时读数中会出现大约七位十进制数字的错误，即最迟从大约 100,000,000 Wh 或 100 MWh。 The current count is 600 MWh already, so this is no feasible way.目前的统计已经是 600 MWh，所以这不是可行的方法。

So for now, I put the mantissa into an unsigned double integer (UDINT, 32 bits on this platform) and perform the conversion according to IEEE 754, which yields the correct value in Wh.所以现在，我将尾数放入一个无符号双整数（UDINT，在此平台上为 32 位）并根据 IEEE 754 执行转换，从而产生以 Wh 为单位的正确值。 This however entails an overflow at 2^32 Wh or about 4.3 GWh, which will last us barely ten years.然而，这需要 2^32 Wh 或大约 4.3 GWh 的溢出，这将持续我们几乎十年。

Since I need only kWh precision, I had the idea of dividing by 1000 early in the conversion.由于我只需要 kWh 精度，因此我在转换初期就有了除以 1000 的想法。 This would put the variable overflow at 4,300 GWh, and the meter's internal counter already overflows at 1,000 GWh.这将使变量溢出达到 4,300 GWh，而电表的内部计数器已经溢出 1,000 GWh。 Problem solved, in theory.问题解决了，理论上。

As IEEE 754 is a binary floating point format however, I can only easily divide by 1024 (right shifting 10 times), which introduces a substantial error.然而，由于 IEEE 754 是二进制浮点格式，我只能轻松地除以 1024（右移 10 次），这会引入大量错误。 Multiplying with a correction factor of 1.024 afterwards would only ever happen in single precision on this platform, nullifying the previous effort.之后乘以 1.024 的校正因子只会在该平台上以单精度发生，从而使之前的努力无效。

Another option would be to output a "high" and "low" UDINT in Wh from the conversion, then I could at least in theory calculate back to kWh, but this seems awkward (and -ful).另一种选择是从转换中输出 Wh 中的“高”和“低”UDINT，然后我至少在理论上可以计算回 kWh，但这似乎很尴尬（和 -ful）。

I'm having the subtle feeling I may have overlooked something (single-person Groupthink so to speak);我有一种微妙的感觉，我可能忽略了一些东西（可以说是单人 Groupthink）； I'm open for any other ideas how I could obtain the 1/1000th of the transferred double value.我对如何获得转移双精度值的 1/1000 的任何其他想法持开放态度。

Thanks and best regards谢谢和最好的问候

Björn比约恩

PS: For your viewing pleasure, this is the solution based on @EricPostpischil's answer -- tailored to platform and task specifics. PS：为了您的观看乐趣，这是基于@EricPostpischil 的答案的解决方案——根据平台和任务的具体情况量身定制。 The language used is SCL (structured control language) as per EN 61131-3, which is kind of a Pascal dialect.使用的语言是 SCL（结构化控制语言），符合 EN 61131-3，这是一种帕斯卡方言。

FUNCTION_BLOCK PAC3200KON_P

VAR_INPUT
    INH : DWORD;
    INL : DWORD;
END_VAR

VAR_OUTPUT
    OUT : UDINT;
    SGN : BOOL;
END_VAR

VAR
    significand:              UDINT;
    exponent, i, shift:       INT;
    sign:                     BOOL;
    d0, d1, y0, y1, r1, temp: DWORD;
END_VAR
(*
    Convert the energy count delivered by Siemens Sentron PAC3200
    (IEEE 754 binary64 format, a.k.a. double) into an UDINT.

    Peculiarities:
    - This hardware platform only supports binary32 (a.k.a. float).

    - The Sentron's internal counter overflows at 1.0e12 Wh (1000 GWh).

    - kWh resolution suffices.

    - If you converted the double directly to UDINT and divided by 1000
      afterwards, the range would be reduced to (2^32-1)/1000 GWh or about
      4.295 GWh.

    - This is why this function first divides the significand by 1000
      and then proceeds with conversion to UDINT. This expands the
      range to (2^32-1) GWh or about 4295 GWh, which isn't reachable in
      practice since the device's internal counter overflows before.

    Background:

    IEEE 754 binary64 bit assignment:

               High-Byte                         Low-Byte
    66665555555555444444444433333333 3322222222221111111111
    32109876543210987654321098765432 10987654321098765432109876543210
    GEEEEEEEEEEESSSSSSSSSSSSSSSSSSSS SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

    G: sign (1: negative)
    E: exponent (biased; subtract 1023) (11 bits)
    S: significand (52 bits)
*)

(*
    significand: Bits 19...0 of high byte und complete low byte

    The significand is initially divided by 1000 using integer division. The
    bits are divided into two parts:

    - d1 contains the 31 most significant bits (plus leading 1)
    - d0 contains the next less significant bits

    In total, we use 48 bits of the original significand.
*)

(* d1: insert significand bits from high byte *)
d1 := INH AND     2#0000_0000_0000_1111_1111_1111_1111_1111;
(* result:        2#0000_0000_0000_HHHH_HHHH_HHHH_HHHH_HHHH *)

(* add the 1 before the binary point *)
d1 := d1 OR       2#0000_0000_0001_0000_0000_0000_0000_0000;
(* result:        2#0000_0000_0001_HHHH_HHHH_HHHH_HHHH_HHHH *)

(* "flush left" shift 11 places *)
d1 := d1 * 2048;
(* result:        2#1HHH_HHHH_HHHH_HHHH_HHHH_H000_0000_0000 *)

(* Insert another 11 bits from low byte (msb ones) *)
d1 := d1 OR (INL / 2097152);
(* result:        2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL *)

(* Base-65536 division. Integer divide by 1000 and save remainder *)
y1 := d1 / 1000;
r1 := TO_DW(TO_UD(d1) MOD 1000);

(*
   The significand now has leading zeroes. Shift left to make space
   at the other end.
*)
FOR shift := 1 TO 31 BY 1 DO
    y1 := y1 * 2;
    IF (y1 AND 2#1000_0000_0000_0000_0000_0000_0000_0000) <> 0 THEN
        EXIT;
    END_IF;
END_FOR;

(*
   d0: insert next 16 bits from the low byte
   (right shift five times and zero out the leading places)
*)
(* bits:             2#xxxx_xxxx_xxxL_LLLL_LLLL_LLLL_LLLx_xxxx *)
d0 := (INL / 32) AND 2#0000_0000_0000_0000_1111_1111_1111_1111;
(* result:           2#0000_0000_0000_0000_LLLL_LLLL_LLLL_LLLL *)

(* Now divide by 1000, factoring in remainder from before *)
y0 := ((r1 * 65536) OR d0) / 1000;

(*
   y1 and y0 contain results from division by 1000. We'll now build a 32 bit
   significand from these.

   y1 = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HHxx_xxxx_xxxx
   y0 = 2#0000_0000_0000_0000_LLLL_LLLL_LLLL_LLLL

   y1 has an uncertain number of zeroes at its end, resulting from the above
   left shifting (number of steps inside variable "shift"). Fill those with the
   most significant bits from y0.

   y0 has 16 valid bits (0..15). Shift right so that the "highest place zero"
   in y1 corresponds with the MSB from y0. (shift by 16-shift)

   y1 = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HHxx_xxxx_xxxx (ex.: shift=10)
   y0 = 2#0000_0000_0000_0000_0000_00LL_LLLL_LLLL
                              ------>^
*)

FOR i := 1 TO 16 - shift BY 1 DO
    y0 := y0 / 2;
END_FOR;

significand := TO_UD(y1 OR y0);
(* Result: 32-bit significand *)

(*
    Exponent: bits (62-32)...(59-32) or bits 30...20 of high byte, respectively

    Coded with bias of 1023 (needs to be subtracted).

    Special cases as per standard:
    - 16#000: signed zero or underflow (map to zero)
    - 16#7FF: inifinite or NaN (map to overflow)
*)
temp := 2#0111_1111_1111_0000_0000_0000_0000_0000 AND INH;
temp := temp / 1048576 ; (* right shift 20 places (2^20) *)
exponent := TO_IN(TO_DI(temp));
exponent := exponent - 1023; (* remove bias *)

(*
   Above, we already left shifted "shift" times, which needs to be taken into
   account here by shifting less.
*)
exponent := exponent - shift;

(*
    The significand will be output as UDINT, but was initially a binary64 with
    binary point behind the leading 1, after which the coded exponent must be
    "executed".

    temp = 2#1.HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL

    As UDINT, this already corresponds to a 31-fold left shift.

    Exponent cases as per IEEE 754:

    - exponent < 0:            result < 1
    - exponent = 0:       1 <= result < 2
    - exponent = x > 0: 2^x <= result < 2^(x+1)

    The UDINT output (32 bit) allows us to represent exponents right up to 31.
    Everything above is mapped to UDINT's maximum value.

    Now determine, after the de facto 31-fold left shift, what shifts remain
    "to do".
*)

IF exponent < 0 THEN
    (* underflow: < 2^0 *)
    significand := 0;
ELSIF exponent > 31 THEN
    (* overflow: > 2^32 - 1 *)
    significand := 4294967295;
ELSE
    (*
        result is significand * 2^exponent or here, as mentioned above,
        significand * 2^(31-exponent).

        The loop index i is the "shift target" after loop execution, which is
        why it starts at 31-1.

        Example: exponent = 27, but de facto we've already got a shift of 31.
        So we'll shift back four times to place the binary point at the right
        position (30, 29, 28, 27):

        before: temp = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL.

        after:  temp = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL.LLLL
                                                           ^<---|
    *)
    FOR i := 30 TO exponent BY -1 DO
        significand := significand / 2;
    END_FOR;
END_IF;

(*
    sign: bit 63 of high byte
*)
sign := (2#1000_0000_0000_0000_0000_0000_0000_0000 AND INH) <> 0;

OUT := significand;
SGN := sign;

END_FUNCTION_BLOCK

The test data I used:我使用的测试数据：

  high byte     low byte  decimal value
=======================================
16#41c558c3, 16#2d3f331e,       716_277
16#41EFFFFF, 16#5E000000,     4_294_966
16#41EFFFFF, 16#DB000000,     4_294_967
16#41F00000, 16#2C000000,     4_294_968
16#426D1A94, 16#A1830000,   999_999_999
16#426D1A94, 16#A2000000, 1_000_000_000
16#426D1A94, 16#A27D0000, 1_000_000_001
16#428F3FFF, 16#FFC18000, 4_294_967_294
16#428F3FFF, 16#FFE0C000, 4_294_967_295
16#428F4000, 16#00000000, 4_294_967_296

BTW, integer literals of the form b#1234 in SCL basically mean "the number 1234 in base b".顺便说一句，SCL 中 b#1234 形式的整数文字基本上表示“基数 b 中的数字 1234”。 Underscores are ignored (they're digit separators for improved readability like eg Python has them).下划线被忽略（它们是用于提高可读性的数字分隔符，例如 Python 有它们）。

Answer 1

/*  This program shows two methods of dividing an integer exceeding 32 bits
    by 1000 using unsigned 32-bit integer arithmetic.
*/


#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>


/*  If the count is less than 2**35, we can shift three bits (divide by 8) and
    then divide by 125 using 32-bit unsigned arithmetic.
*/
static uint32_t ShiftThenDivide(uint64_t x)
{
    uint32_t y = x >> 3;
    return y / 125;
}


/*  Given any count less than 1000*2**32 (which exceeds the 2**40 requirement),
    we can perform long division in radix 65536.
*/
static uint64_t LongDivision(uint64_t x)
{
    /*  Set d1 to the high two base-65536 digits (bits 17 to 31) and d0 to
        the low digit (bits 0 to 15).
    */
    uint32_t d1 = x >> 16, d0 = x & 0xffffu;

    //  Get the quotient and remainder of dividing d1 by 1000.
    uint32_t y1 = d1 / 1000, r1 = d1 % 1000;

    /*  Combine the previous remainder with the low digit of the dividend and
        divide by 1000.
    */
    uint32_t y0 = (r1<<16 | d0) / 1000;

    //  Return a quotient formed from the two quotient digits.
    return y1 << 16 | y0;
}


static void Test(uint64_t x)
{
    //  Use 64-bit arithmetic to get a reference result.
    uint32_t y0 = x / 1000;

    //  ShiftThenDivide only works up to 2**35, so only test up to that.
    if (x < UINT64_C(1) << 35)
    {
        uint32_t y1 = ShiftThenDivide(x);
        if (y1 != y0)
        {
            printf("Error, 0x%" PRIx64 " / 1000 = 0x%" PRIx32 ", but ShiftThenDivide produces 0x%" PRIx32 ".\n",
                x, y0, y1);
            exit(EXIT_FAILURE);
        }
    }

    //  Test LongDivision.
    uint32_t y2 = LongDivision(x);
    if (y2 != y0)
    {
        printf("Error, 0x%" PRIx64 " / 1000 = 0x%" PRIx32 ", but LongDivision produces 0x%" PRIx32 ".\n",
            x, y0, y2);
        exit(EXIT_FAILURE);
    }
}


int main(void)
{
    srandom(time(0));

    //  Test all possible values for the upper eight bits.
    for (uint64_t upper = 0; upper < 1<<8; ++upper)
    {
        //  Test some edge cases.
        uint64_t x = upper << 32;
        Test(x);
        Test(x+1);
        Test(x-1 & 0xffffffffffu);
            /*  When x is zero, x-1 would wrap modulo 2**64, but that is
                outside our supported domain, so wrap modulo 2**40.
            */

        //  Test an assortment of low 32 bits.
        for (int i = 0; i < 1000; ++i)
        {
            uint32_t r0 = random() & 0xffffu, r1 = random() & 0xffffu;
            uint64_t lower = r1 << 16 | r0;
            Test(x | lower);
        }
    }
}

Answer 2

I would address the problem in a slightly different way.我会以稍微不同的方式解决这个问题。 Since the OP did not mention any used programing language, I write down some pseudocode here.由于 OP 没有提到任何使用过的编程语言，我在这里写下一些伪代码。 I will assume that the binary64 floating-point number is passed to the compiler as a sequence of 8 bytes.我将假设 binary64 浮点数作为 8 个字节的序列传递给编译器。 I will assume that the OP will take care of endianness where needed.我将假设 OP 将在需要时处理字节序。

1. Split the binary64 into three binary32 floating-point numbers: 1.将binary64拆分为三个binary32浮点数：

A binary64 floating-point number is represented by a single sign-bit, 11 exponent bits and 52 bits representing the significant:一个binary64 浮点数由一个符号位、11 个指数位和 52 个代表有效位的位来表示：

and is computed as:并计算为：

(−1)^b₆₃ (1 + Sum(b_52−i 2⁻ⁱ;i = 1 → 52 )) × 2^e−1023

A binary32 floating-point number is represented by a single sign-bit, 8 exponent bits and 32 bits representing the significant:一个binary32 浮点数由一个符号位、8 个指数位和 32 个代表有效位的位表示：

and is computed as:并计算为：

(−1)^b₃₁ (1 + Sum(b_23−i 2⁻ⁱ;i = 1 → 23 )) × 2^e−127

The idea is now to create three binary32 floating-point numbers f{1,2,3} such that, when using real arithmetic (no floating-point approximations), the binary64 floating-point number d is given by:现在的想法是创建三个 binary32 浮点数f{1,2,3}这样，当使用实数算术（无浮点近似值）时，binary64 浮点数d由下式给出：

d = f1 + f2 + f3

Assume that the function EXTRACT(d,n,m) returns an integer extracted from the bits n till m from the binary64 bit-representation d :假设函数EXTRACT(d,n,m)返回一个整数，该整数从二进制 64 位表示d的位n到m提取：

function val Extract(d,n,m)
   val = Sum(b_52−i 2ⁿ⁻ⁱ;i = m → n )

and the function Exponent(d) returns the value e-1023 of the binary64 bit-representation d .并且函数Exponent(d)返回 binary64 位表示d的值e-1023 。

Then we know that然后我们知道

f1 = (2^23 + Extract(d,1,23)) * 2^(Exponent(d) - 23)
f2 = Extract(d,24,46) * 2^(Exponent(d) - 46)
f3 = Extract(d,47,52) * 2^(Exponent(d) - 52)

2. Divide the values by 1000: 2. 将值除以 1000：

This is, unfortunately, easier said than done.不幸的是，这说起来容易做起来难。 It is well known that computing with finite-precision implies some rounding errors, leading to inexact results for a computation.众所周知，有限精度的计算意味着一些舍入误差，导致计算结果不准确。 This is exactly what we try to avoid here.这正是我们在这里试图避免的。 If we would just compute如果我们只是计算

f1 * 1E-3 + f2 * 1E-3 + f3 * 1E-3

we would introduce rounding errors.我们会引入舍入误差。

Assume a and b are 2 floating-point numbers, the function fl(x) returns the floating-point number of the real value x and a OP b represents the full real number in real arithmetic of the basic operations + , - and * .假定a和b是2浮点数，函数fl(x)返回的实际价值的浮点数x和a OP b表示的基本操作的实际算法的完整实数+ ， -和* 。 With this, we know that a OP b != fl(a OP b) as the real number cannot always be fully represented by a floating-point number.有了这个，我们知道a OP b != fl(a OP b)作为实数不能总是完全由浮点数表示。 However, it can be shown that a OP b = fl(a OP b) + y with y a floating-point number.然而，可以证明a OP b = fl(a OP b) + y y是一个浮点数。 This y is the error which we would miss in the above computation when just computing f1 * fl(1E-3) .这个y是我们在计算f1 * fl(1E-3)时会在上面的计算中遗漏的错误。

So to compute d * fl(1E-3) accurately, we will need to keep track of the error terms.因此，为了准确计算d * fl(1E-3) ，我们需要跟踪误差项。 For this, we will make use of some error-free transformations which are reviewed in the paper Accurate summation, dot product and polynomial evaluation in complex floating-point arithmetic :为此，我们将使用一些无错误转换，这些转换在复浮点运算中的精确求和、点积和多项式评估一文中进行了评论：

# error-free transformation of the sum of two floating-point numbers
function [x,y] = TwoSum(a,b)
   x = a + b
   z = x - a
   y = ((a - (x - z)) + (b - z))
# Error-free split of a lfoating point number in two parts
function [x,y] Split(a)
   c = (2^12 - 1) * a
   x = (c - (c - a))
   y = a - x
# error-free transformation of the product of two floating-point numbers
function [x,y] = TwoProduct(a,b)
   x = a * b
   [a1,a2] = Split(a); [b1,b2] = Split(b)
   y = (a2*b2 - (((x - a1*b1) - a2*b1) - a1*b2))

3. The complete function: 3、功能齐全：

So if we want to rescale the binary64 number with bit-representation d using binary32 floating-point arithmetic, we should use the function:因此，如果我们想使用二进制 32 浮点算法重新缩放二进制 64 数字，并使用位表示d ，我们应该使用该函数：

# rescale double-precision d by a single-precision a
function res = Rescale(d,a)
   # first term
   f = (2^23 + Extract(d,1,23)) * 2^(Exponent(d) - 23)
   [p,s] = TwoProduct(f,a)
   # second term
   f = Extract(d,24,46) * 2^(Exponent(d) - 46)
   [h,r] = TwoProduct(f,a)
   [p,q] = TwoSum(p,h)
   s = s + (q + r)       # the error term
   # third term
   f = Extract(d,47,52) * 2^(Exponent(d) - 52)
   [h,r] = TwoProduct(f,a)
   [p,q] = TwoSum(p,h)
   s = s + (q + r)       # the error term
   # the final result
   res = p + s

This will have kept track of all numeric errors within floating-point math and compensated the result accordingly.这将跟踪浮点数学中的所有数字错误并相应地补偿结果。 As a result, the value res returned by Rescale will represent the most accurate single-precision value of d/1000 .因此， Rescale返回的值res将表示d/1000的最准确的单精度值。

Answer 3

1e12 Wh / 1kWh = 1e9. 1e12 Wh / 1kWh = 1e9。

A 4-byte, 32-bit, INT (signed or unsigned) gives you a little more than 9 significant digits. 4 字节、32 位、INT（有符号或无符号）为您提供多于 9 位有效数字。 But you would have to remember that it is in units of KWh, not Wh.但是您必须记住，它的单位是 KWh，而不是 Wh。 And each time you add to it, you are potentially getting another rounding error.每次添加时，您都可能会遇到另一个舍入错误。
FLOAT has only ~7 digits of resolution; FLOAT的分辨率只有约 7 位； you need 9. It is OK for the sensor to send a FLOAT , but it is not OK to accumulate in FLOAT .你需要 9. 传感器发送一个FLOAT ，但是在FLOAT累积是不行的。
DOUBLE is 8 bytes, ~16 significant digits. DOUBLE是 8 个字节，~16 位有效数字。 This would let you store Wh.这将让您存储 Wh。

A compromise is to accumulate using DOUBLE , but divide by 1000 and store in a 4-byte INT.一种折衷方案是使用DOUBLE进行累加，但除以 1000 并存储在 4 字节的 INT 中。

Is there a problem with storing in DOUBLE ?在DOUBLE存储有问题吗？ Other than taking extra space, it essentially solves all the problems -- more than adequate resolution and protection against rounding errors;除了占用额外空间之外，它基本上解决了所有问题——不仅仅是足够的分辨率和防止舍入错误； ability to store the 'natural' unit of Wh;能够存储 Wh 的“自然”单位； etc. I would use DOUBLE if possible.等。如果可能的话，我会使用 DOUBLE。

(Unsigned 32-bit integer would be second choice.) （无符号 32 位整数将是第二选择。）

I would not even consider a 40-bit integer (or two 32-bit ints) due to being clumsy to work with and probably cause difficulty to porting.我什至不会考虑 40 位整数（或两个 32 位整数），因为使用起来很笨拙并且可能导致移植困难。

Thinking out of the box...开箱即用的思考......

Store subtotals for each year.存储每年的小计。 Then, when you need the grand total, sum up the subtotals.然后，当您需要总计时，将小计相加。 (This is what I might do anyway for a database-oriented solution.) （对于面向数据库的解决方案，无论如何我都可能会这样做。）

在仅支持 32 位浮点数的平台上，如何将 IEEE754 64 位双精度数除以 1000？

问题描述

3 个解决方案

解决方案1
3 已采纳 2019-08-30 12:55:45

解决方案2
1 2019-09-03 21:11:39

解决方案3
0 2020-01-13 21:17:51

在仅支持 32 位浮点数的平台上，如何将 IEEE754 64 位双精度数除以 1000？

问题描述

3 个解决方案

解决方案1 3 已采纳 2019-08-30 12:55:45

解决方案2 1 2019-09-03 21:11:39

解决方案3 0 2020-01-13 21:17:51

解决方案1
3 已采纳 2019-08-30 12:55:45

解决方案2
1 2019-09-03 21:11:39

解决方案3
0 2020-01-13 21:17:51