浮点数如何转换为科学记数法进行存储？

Question

http://www.cs.yale.edu/homes/aspnes/pinewiki/C(2f)FloatingPoint.html http://www.cs.yale.edu/homes/aspnes/pinewiki/C(2f)FloatingPoint.html

I was looking into why there are sometimes rounding issues when storing a float.我正在研究为什么在存储浮点数时有时会出现舍入问题。 I read the above link, and see that floats are converted to scientific notation.我阅读了上面的链接，看到浮点数转换为科学计数法。

https://babbage.cs.qc.cuny.edu/IEEE-754/index.xhtml https://babbage.cs.qc.cuny.edu/IEEE-754/index.xhtml

Base is always 2. So, 8 is stored as 1 * 2^3.基数始终为 2。因此，8 存储为 1 * 2^3。 9 is stored as 1.001 * 2^3. 9 存储为 1.001 * 2^3。

What is the math algorithm to determine the mantissa/significand and exponent?确定尾数/有效数和指数的数学算法是什么？

Answer 1

Here is C++ code to convert a decimal string to a binary floating-point value.这是 C++ 代码，用于将十进制字符串转换为二进制浮点值。 Although the question is tagged C, I presume the question is more about the algorithm and calculations than the programming language.虽然这个问题被标记为 C，但我认为这个问题更多的是关于算法和计算，而不是编程语言。

The DecimalToFloat class is constructed with a string that contains solely decimal digits and a decimal point (a period, most one). DecimalToFloat class 由仅包含十进制数字和小数点（句点，最多一个）的字符串构成。 In its constructor, it shows how to use elementary school multiplication and long division to convert the number from decimal to binary.在其构造函数中，它展示了如何使用小学乘法和长除法将数字从十进制转换为二进制。 This demonstrates the fundamental concepts using elementary arithmetic.这演示了使用基本算术的基本概念。 Real implementations of decimal-to-floating-point conversion in commercial software using algorithms that are faster and more complicated.使用更快和更复杂的算法在商业软件中实际实现小数到浮点数的转换。 They involve prepared tables, analysis, and proofs and are the subjects of academic papers.它们涉及准备好的表格、分析和证明，并且是学术论文的主题。 A significant problem of quality implementations of decimal-to-binary-floating-point conversion is getting the rounding correct.十进制到二进制浮点转换的质量实现的一个重要问题是正确舍入。 The disparate nature of powers of ten to powers of two (both positive and negative powers) makes it tricky to correctly determine when some values are above or below a point where rounding changes. 10 次方与 2 次方（正和负幂）的不同性质使得正确确定某些值何时高于或低于舍入变化的点变得很棘手。 Normally, when we are parsing something like 123e300 , we want to figure out the binary floating-point result without actually calculating 10 ³⁰⁰ .通常，当我们解析123e300类的东西时，我们想在不实际计算 10 ³⁰⁰的情况下计算二进制浮点结果。 That is a much more extensive subject.这是一个更广泛的主题。

The GetValue routine finishes the preparation fo the number, taking the information prepared by the constructor and rounding it to the final floating-point form. GetValue例程完成数字的准备工作，获取构造函数准备的信息并将其四舍五入为最终的浮点形式。

Negative numbers and exponential (scientific) notation are not handled.不处理负数和指数（科学）符号。 Handling negative numbers is of course easy.处理负数当然很容易。 Exponential notation could be accommodated by shifting the input—moving the decimal point right for positive exponents or left for negative exponents.可以通过移动输入来适应指数表示法——将小数点右移表示正指数，左移表示负指数。 Again, this is not the fastest way to perform the conversion, but it demonstrates fundamental ideas.同样，这不是执行转换的最快方法，但它展示了基本思想。

/*  This code demonstrates conversion of decimal numerals to binary
    floating-point values using the round-to-nearest-ties-to-even rule.

    Infinities and subnormal values are supported and assumed.

    The basic idea is to convert the decimal numeral to binary using methods
    taught in elementary school.  The integer digits are repeatedly divided by
    two to extract a string of bits in low-to-high position-value order.  Then
    sub-integer digits are repeatedly multiplied by two to continue extracting
    a string of bits in high-to-low position-value order.  Once we have enough
    bits to determine the rounding direction or the processing exhausts the
    input, the final value is computed.

    This code is not (and will not be) designed to be efficient.  It
    demonstrates the fundamental mathematics and rounding decisions.
*/


#include <algorithm>
#include <limits>
#include <cmath>
#include <cstring>


template<typename Float> class DecimalToFloat
{
private:

    static_assert(std::numeric_limits<Float>::radix == 2,
        "This code requires the floatng-point radix to be two.");

    //  Abbreviations for parameters describing the floating-point format.
    static const int Digits          = std::numeric_limits<Float>::digits;
    static const int MaximumExponent = std::numeric_limits<Float>::max_exponent;
    static const int MinimumExponent = std::numeric_limits<Float>::min_exponent;

    /*  For any rounding rule supported by IEEE 754 for binary floating-point,
        the direction in which a floating-point result should be rounded is
        completely determined by the bit in the position of the least
        significant bit (LSB) of the significand and whether the value of the
        trailing bits are zero, between zero and 1/2 the value of the LSB,
        exactly 1/2 the LSB, or between 1/2 the LSB and 1.

        In particular, for round-to-nearest, ties-to-even, the decision is:

            LSB     Trailing Bits   Direction
            0       0               Down
            0       In (0, 1/2)     Down
            0       1/2             Down
            0       In (1/2, 1)     Up
            1       0               Down
            1       In (0, 1/2)     Down
            1       1/2             Up
            1       In (1/2, 1)     Up

        To determine whether the value of the trailing bits is 0, in (0, 1/2),
        1/2, or in (1/2, 1), it suffices to know the first of the trailing bits
        and whether the remaining bits are zeros or not:

            First   Remaining       Value of Trailing Bits
            0       All zeros       0
            0       Not all zeros   In (0, 1/2)
            1       All zeros       1/2
            1       Not all zeros   In (1/2, 1)

        To capture that information, we maintain two bits in addition to the
        bits in the significand.  The first is called the Round bit.  It is the
        first bit after the position of the least significand bit in the
        significand.  The second is called the Sticky bit.  It is set if any
        trailing bit after the first is set.

        The bits for the significand are kept in an array along with the Round
        bit and the Sticky bit.  The constants below provide array indices for
        locating the LSB, the Round Bit, and the Sticky bit in that array.
    */
    static const int LowBit = Digits-1; //  Array index for LSB in significand.
    static const int Round  = Digits;   //  Array index for rounding bit.
    static const int Sticky = Digits+1; //  Array index for sticky bit.

    char *Decimal;          //  Work space for the incoming decimal numeral.

    int  N;                 //  Number of bits incorporated so far.
    char Bits[Digits+2];    //  Bits for significand plus two for rounding.
    int  Exponent;          //  Exponent adjustment needed.


    /*  PushBitHigh inserts a new bit into the high end of the bits we are
        accumulating for the significand of a floating-point number.

        First, the Round bit shifted down by incorporating it into the Sticky
        bit, using an OR so that the Sticky bit is set iff any bit pushed below
        the Round bit is set.

        Then all bits from the significand are shifted down one position,
        which moves the least significant bit into the Round position and
        frees up the most significant bit.

        Then the new bit is put into the most significant bit.
    */
    void PushBitHigh(char Bit)
    {
        Bits[Sticky] |= Bits[Round];
        std::memmove(Bits+1, Bits, Digits * sizeof *Bits);
        Bits[0] = Bit;

        ++N;        //  Count the number of bits we have put in the significand.
        ++Exponent; //  Track the absolute position of the leading bit.
    }


    /*  PushBitLow inserts a new bit into the low end of the bits we are
        accumulating for the significand of a floating-point number.

        If we have no previous bits and the new bit is zero, we are just
        processing leading zeros in a number less than 1.  These zeros are not
        significant.  They tell us the magnitude of the number.  We use them
        only to track the exponent that records the position of the leading
        significant bit.  (However, exponent is only allowed to get as small as
        MinimumExponent, after which we must put further bits into the
        significand, forming a subnormal value.)

        If the bit is significant, we record it.  If we have not yet filled the
        regular significand and the Round bit, the new bit is recorded in the
        next space.  Otherwise, the new bit is incorporated into the Sticky bit
        using an OR so that the Sticky bit is set iff any bit below the Round
        bit is set.
    */
    void PushBitLow(char Bit)
    {
        if (N == 0 && Bit == 0 && MinimumExponent < Exponent)
            --Exponent;
        else
            if (N < Sticky)
                Bits[N++] = Bit;
            else
                Bits[Sticky] |= Bit;
    }


    /*  Determined tells us whether the final value to be produced can be
        determined without any more low bits.  This is true if and only if:

            we have all the bits to fill the significand, and

            we have at least one more bit to help determine the rounding, and

            either we know we will round down because the Round bit is 0 or we
            know we will round up because the Round bit is 1 and at least one
            further bit is 1 or the least significant bit is 1.
    */
    bool Determined() const
    {
        if (Digits < N)
            if (Bits[Round])
                return Bits[LowBit] || Bits[Sticky];
            else
                return 1;
        else
            return 0;
    }


    //  Get the floating-point value that was parsed from the source numeral.
    Float GetValue() const
    {
        //  Decide whether to round up or not.
        bool RoundUp = Bits[Round] && (Bits[LowBit] || Bits[Sticky]);

        /*  Now we prepare a floating-point number that contains a significand
            with the bits we received plus, if we are rounding up, one added to
            the least significant bit.
        */

        //  Start with the adjustment to the LSB for rounding.
        Float x = RoundUp;

        //  Add the significand bits we received.
        for (int i = Digits-1; 0 <= i; --i)
            x = (x + Bits[i]) / 2;

        /*  If we rounded up, the addition may have carried out of the
            initial significand.  In this case, adjust the scale.
        */
        int e = Exponent;
        if (1 <= x)
        {
            x /= 2;
            ++e;
        }

        //  Apply the exponent and return the value.
        return MaximumExponent < e ? INFINITY : std::scalbn(x, e);
    }


public:

    /*  Constructor.

        Note that this constructor allocates work space.  It is bad form to
        allocate in a constructor, but this code is just to demonstrate the
        mathematics, not to provide a conversion for use in production
        software.
    */
    DecimalToFloat(const char *Source) : N(), Bits(), Exponent()
    {
        //  Skip leading sources.
        while (*Source == '0')
            ++Source;

        size_t s = std::strlen(Source);

        /*  Count the number of integer digits (digits before the decimal
            point if it is present or before the end of the string otherwise)
            and calculate the number of digits after the decimal point, if any.
        */
        size_t DigitsBefore = 0;
        while (Source[DigitsBefore] != '.' && Source[DigitsBefore] != 0)
            ++DigitsBefore;

        size_t DigitsAfter = Source[DigitsBefore] == '.' ? s-DigitsBefore-1 : 0;

        /*  Allocate space for the integer digits or the sub-integer digits,
            whichever is more numerous.
        */
        Decimal = new char[std::max(DigitsBefore, DigitsAfter)];

        /*  Copy the integer digits into our work space, converting them from
            digit characters ('0' to '9') to numbers (0 to 9).
        */
        for (size_t i = 0; i < DigitsBefore; ++i)
            Decimal[i] = Source[i] - '0';

        /*  Convert the integer portion of the numeral to binary by repeatedly
            dividing it by two.  The remainders form a bit string representing
            a binary numeral for the integer part of the number.  They arrive
            in order from low position value to high position value.

            This conversion continues until the numeral is exhausted (High <
            Low is false) or we see it is so large the result overflows
            (Exponent <= MaximumExponent is false).

            Note that Exponent may exceed MaximumExponent while we have only
            produced 0 bits during the conversion.  However, because we skipped
            leading zeros above, we know there is a 1 bit coming.  That,
            combined with the excessive Exponent, guarantees the result will
            overflow.
        */

        for (char *High = Decimal, *Low = Decimal + DigitsBefore;
            High < Low && Exponent <= MaximumExponent;)
        {
            //  Divide by two.
            char Remainder = 0;
            for (char *p = High; p < Low; ++p)
            {
                /*  This is elementary school division:  We bring in the
                    remainder from the higher digit position and divide by the
                    divisor.  The remainder is kept for the next position, and
                    the quotient becomes the new digit in this position.
                */
                char n = *p + 10*Remainder;
                Remainder = n % 2;
                n /= 2;

                /*  As the number becomes smaller, we discard leading zeros:
                    If the new digit is zero and is in the highest position,
                    we discard it and shorten the number we are working with.
                    Otherwise, we record the new digit.
                */
                if (n == 0 && p == High)
                    ++High;
                else
                    *p = n;
            }

            //  Push remainder into high end of the bits we are accumulating.
            PushBitHigh(Remainder);
        }

        /*  Copy the sub-integer digits into our work space, converting them
            from digit characters ('0' to '9') to numbers (0 to 9).

            The convert the sub-integer portion of the numeral to binary by
            repeatedly multiplying it by two.  The carry-outs continue the bit
            string.  They arrive in order from high position value to low
            position value.
        */

        for (size_t i = 0; i < DigitsAfter; ++i)
            Decimal[i] = Source[DigitsBefore + 1 + i] - '0';

        for (char *High = Decimal, *Low = Decimal + DigitsAfter;
            High < Low && !Determined();)
        {
            //  Multiply by two.
            char Carry = 0;
            for (char *p = Low; High < p--;)
            {
                /*  This is elementary school multiplication:  We multiply
                    the digit by the multiplicand and add the carry.  The
                    result is separated into a single digit (n % 10) and a
                    carry (n / 10).
                */
                char n = *p * 2 + Carry;
                Carry = n / 10;
                n %= 10;

                /*  Here we discard trailing zeros:  If the new digit is zero
                    and is in the lowest position, we discard it and shorten
                    the numeral we are working with.  Otherwise, we record the
                    new digit.
                */
                if (n == 0 && p == Low-1)
                    --Low;
                else
                    *p = n;
            }

            //  Push carry into low end of the bits we are accumulating.
            PushBitLow(Carry);
        }

        delete [] Decimal;
    }

    //  Conversion operator.  Returns a Float converted from this object.
    operator Float() const { return GetValue(); }
};


#include <iostream>
#include <cstdio>
#include <cstdlib>


static void Test(const char *Source)
{
    std::cout << "Testing " << Source << ":\n";

    DecimalToFloat<float> x(Source);

    char *end;
    float e = std::strtof(Source, &end);
    float o = x;

    /*  Note:  The C printf is used here for the %a conversion, which shows the
        bits of floating-point values clearly.  If your C++ implementation does
        not support this, this may be replaced by any display of floating-point
        values you desire, such as printing them with all the decimal digits
        needed to distinguish the values.
    */
    std::printf("\t%a, %a.\n", e, o);

    if (e != o)
    {
        std::cout << "\tError, results do not match.\n";
        std::exit(EXIT_FAILURE);
    }
}


int main(void)
{
    Test("0");
    Test("1");
    Test("2");
    Test("3");
    Test(".25");
    Test(".0625");
    Test(".1");
    Test(".2");
    Test(".3");
    Test("3.14");
    Test(".00000001");
    Test("9841234012398123");
    Test("340282346638528859811704183484516925440");
    Test("340282356779733661637539395458142568447");
    Test("340282356779733661637539395458142568448");
    Test(".00000000000000000000000000000000000000000000140129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125");

    //  This should round to the minimum positive (subnormal), as it is just above mid-way.
    Test(".000000000000000000000000000000000000000000000700649232162408535461864791644958065640130970938257885878534141944895541342930300743319094181060791015626");

    //  This should round to zero, as it is mid-way, and the even rule applies.
    Test(".000000000000000000000000000000000000000000000700649232162408535461864791644958065640130970938257885878534141944895541342930300743319094181060791015625");

    //  This should round to zero, as it is just below mid-way.
    Test(".000000000000000000000000000000000000000000000700649232162408535461864791644958065640130970938257885878534141944895541342930300743319094181060791015624");
}

Answer 2

One of the surprising things about a real, practical computer -- surprising to beginning programmers who have been tasked with writing artificial little binary-to-decimal conversion programs, anyway -- is how thoroughly ingrained the binary number system is in an actual computer, and how few and how diffuse any actual binary/decimal conversion routines actually are.一台真实的、实用的计算机令人惊讶的事情之一——无论如何，对于那些负责编写人工小型二进制到十进制转换程序的初学者来说，令人惊讶的是二进制数系统在实际计算机中的根深蒂固，以及任何实际的二进制/十进制转换例程实际上有多少和多么分散。 In the C world, for example (and if we confine our attention to integers for the moment), there is basically one binary-to-decimal conversion routine, and it's buried inside printf , where the %d directive is processed.例如，在 C 世界中（如果我们暂时只关注整数），基本上有一个二进制到十进制的转换例程，它隐藏在printf中，其中处理%d指令。 There are perhaps three decimal-to-binary converters: atof() , strtol() , and the %d conversion inside scanf .可能有三个十进制到二进制的转换器： atof() 、 strtol()和scanf中的%d转换。 (There might be another one inside the C compiler, where it converts your decimal constants into binary, although the compiler might just call strtol() directly for those, too.) （C 编译器中可能还有另一个，它将十进制常量转换为二进制，尽管编译器也可能直接为这些常量调用strtol() 。）

I bring this all up for background.我把这一切都作为背景。 The question of "what's the actual algorithm for constructing floating-point numbers internally?" “内部构造浮点数的实际算法是什么？”的问题是什么？ is a fair one, and I'd like to think I know the answer, but as I mentioned in the comments, I'm chagrined to discover that I don't, really: I can't describe a clear, crisp "algorithm".是一个公平的，我想我知道答案，但正如我在评论中提到的，我很遗憾地发现我不知道，真的：我无法描述一个清晰、清晰的“算法”。 I can and will show you some code that gets the job done, but you'll probably find it unsatisfying, as if I'm cheating somehow -- because a number of the interesting details happen more or less automatically, as we'll see.我可以并且会向你展示一些完成工作的代码，但你可能会觉得它不令人满意，就好像我在作弊一样——因为许多有趣的细节或多或少是自动发生的，我们将看到.

Basically, I'm going to write a version of the standard library function atof() .基本上，我将编写标准库 function atof()的一个版本。 Here are my ground rules:这是我的基本规则：

I'm going to assume that the input is a string of characters.我将假设输入是一串字符。 (This isn't really an assumption at all; it's a restatement of the original problem, which is to write a version of atof .) （这根本不是一个假设；它是对原始问题的重述，即编写atof的一个版本。）
I'm going to assume that we can construct the floating-point number "0.0".我将假设我们可以构造浮点数“0.0”。 (In IEEE 754 and most other formats, it's all-bits-0, so that's not too hard.) （在 IEEE 754 和大多数其他格式中，它是全位 0，所以这并不难。）
I'm going to assume that we can convert the integers 0-9 to their corresponding floating-point equivalents.我将假设我们可以将整数 0-9 转换为它们对应的浮点等价物。
I'm going to assume that we can add and multiply any floating-point numbers we want to.我将假设我们可以添加和乘以我们想要的任何浮点数。 (This is the biggie, although I'll describe those algorithms later.) But on any modern computer, there's almost certainly a floating-point unit, that has built-in instructions for the basic floating-point operations like addition and multiplication, so this isn't an unreasonable assumption, either. （这是最重要的，虽然我稍后会描述这些算法。）但是在任何现代计算机上，几乎可以肯定有一个浮点单元，它具有用于基本浮点运算（如加法和乘法）的内置指令，所以这也不是一个不合理的假设。 (But it does end up hiding some of the interesting aspects of the algorithm, passing the buck to the hardware designer to have implemented the instructions correctly.) （但它最终确实隐藏了算法的一些有趣方面，将责任推给了硬件设计人员以正确实现指令。）
I'm going to initially assume that we have access to the standard library functions atoi and pow .我将首先假设我们可以访问标准库函数atoi和pow 。 This is a pretty big assumption, but again, I'll describe later how we could write those from scratch if we wanted to.这是一个相当大的假设，但我将在稍后描述如果我们愿意，我们如何从头开始编写这些。 I'm also going to assume the existence of the character classification functions in <ctype.h> , especially isdigit() .我还将假设<ctype.h>中存在字符分类函数，尤其是isdigit() 。

But that's about it.但仅此而已。 With those prerequisites, it turns out we can write a fully-functional version of atof() all by ourselves.有了这些先决条件，事实证明我们可以自己编写一个全功能的atof()版本。 It might not be fast, and it almost certainly won't have all the right rounding behaviors out at the edges, but it will work pretty well.它可能不会很快，并且几乎可以肯定它不会在边缘具有所有正确的舍入行为，但它会很好地工作。 (I'm even going to handle negative numbers, and exponents.) Here's how it works: （我什至要处理负数和指数。）它是这样工作的：

skip leading whitespace跳过前导空格
look for '-'寻找'-'
scan digit characters, converting each one to the corresponding digit by subtracting '0' (aka ASCII 48)扫描数字字符，通过减去'0' （又名 ASCII 48）将每个字符转换为相应的数字
accumulate a floating-point number (with no fractional part yet) representing the integer implied by the digits -- the significand -- and this is the real math, multiplying the running accumulation by 10 and adding the next digit累积一个浮点数（还没有小数部分），代表数字暗示的 integer -有效数字 - 这是真正的数学运算，将运行累积乘以 10 并添加下一个数字
if we see a decimal point, count the number of digits after it如果我们看到一个小数点，计算它后面的位数
when we're done scanning digits, see if there's an e / E and some more digits indicating an exponent当我们完成扫描数字时，看看是否有e / E和一些指示指数的数字
if necessary, multiply or divide our accumulated number by a power of 10, to take care of digits past the decimal, and/or the explicit exponent.如有必要，将我们累积的数字乘以或除以 10 的幂，以处理小数点后的数字和/或显式指数。

Here's the code:这是代码：

#include <ctype.h>
#include <stdlib.h>      /* just for atoi() */
#include <math.h>        /* just for pow() */

#define TRUE 1
#define FALSE 0

double my_atof(const char *str)
{
    const char *p;
    double ret;
    int negflag = FALSE;
    int exp;
    int expflag;

    p = str;

    while(isspace(*p))
        p++;

    if(*p == '-')
        {
        negflag = TRUE;
        p++;
        }

    ret = 0.0;              /* assumption 2 */
    exp = 0;
    expflag = FALSE;

    while(TRUE)
        {
        if(*p == '.')
            expflag = TRUE;
        else if(isdigit(*p))
            {
            int idig = *p - '0';     /* assumption 1 */
            double fdig = idig;      /* assumption 3 */
            ret = 10. * ret + fdig;  /* assumption 4 */
            if(expflag)
                exp--;
            }
        else    break;

        p++;
        }

    if(*p == 'e' || *p == 'E')
        exp += atoi(p+1);   /* assumption 5a */

    if(exp != 0)
        ret *= pow(10., exp);   /* assumption 5b */

    if(negflag)
        ret = -ret;

    return ret;
}

Before we go further, I encourage you to copy-and-paste this code into a nearby C compiler, and compile it, to convince yourself that I haven't cheated too badly.在我们进一步 go 之前，我鼓励您将此代码复制并粘贴到附近的 C 编译器中，然后编译它，以说服自己我没有作弊太严重。 Here's a little main() to invoke it with:这是一个小main()来调用它：

#include <stdio.h>

int main(int argc, char *argv[])
{
    double d = my_atof(argv[1]);
    printf("%s -> %g\n", argv[1], d);
}

(If you or your IDE aren't comfortable with command-line invocations, you can use fgets or scanf to read the string to hand to my_atof , instead.) （如果您或您的 IDE 对命令行调用不满意，您可以使用fgets或scanf来读取要交给my_atof的字符串。）

But, I know, your question was "How does 9 get converted to 1.001 * 2^3?", and I still haven't really answered that, have I?但是，我知道，您的问题是“9 如何转换为 1.001 * 2^3？”，我还没有真正回答，对吗？ So let's see if we can find where that happens.所以让我们看看我们是否能找到发生这种情况的地方。

First of all, that bit pattern 1001 ₂ for 9 came from... nowhere, or everywhere, or it was there all along, or something.首先，那个位模式 1001 ₂ for 9 来自......无处，或无处不在，或者它一直都在那里，或者什么。 The character 9 came in, probably with a bit pattern of 111001 ₂ (in ASCII).字符9进来了，可能带有 111001 ₂的位模式（ASCII 格式）。 We subtracted 48 = 110000 ₂ , and out popped 1001 ₂ .我们减去 48 = 110000 ₂ ，然后弹出 1001 ₂ 。 (Even before doing the subtraction, you can see it hiding there at the end of 111001.) （即使在做减法之前，你也可以看到它隐藏在 111001 的末尾。）

But then what turned 1001 into 1.001E3?但是，是什么把 1001 变成了 1.001E3？ That was basically my "assumption 3", as embodied in the line这基本上是我的“假设 3”，体现在该行中

double fdig = idig;

It's easy to write that line in C, so we don't really have to know how it's done, and the compiler probably turns it into a 'convert integer to float' instruction, so the compiler writer doesn't have to know how to do it, either.在 C 中写这行很容易，所以我们不必知道它是如何完成的，编译器可能会将其转换为“将 integer 转换为浮点”指令，因此编译器编写者不必知道如何也这样做。

But, if we did have to implement that ourselves, at the lowest level, we could.但是，如果我们必须自己实施，在最低级别，我们可以。 We know we have a single-digit (decimal) number, occupying at most 4 bits.我们知道我们有一个单数（十进制）数，最多占用 4 位。 We could stuff those bits into the significand field of our floating-point format, with a fixed exponent (perhaps -3).我们可以将这些位填充到浮点格式的有效位域中，具有固定的指数（可能是 -3）。 We might have to deal with the peculiarities of an "implicit 1" bit, and if we didn't want to inadvertently create a denormalized number, we might have to some more tinkering, but it would be straightforward enough, and relatively easy to get right, because there are only 10 cases to test.我们可能必须处理“隐式 1”位的特殊性，如果我们不想无意中创建一个非规范化的数字，我们可能需要进行更多的修补，但它足够简单，并且相对容易获得对，因为只有 10 个案例要测试。 (Heck, if we found writing code to do the bit manipulations troublesome, we could even use a 10-entry lookup table.) （哎呀，如果我们发现编写代码来进行位操作很麻烦，我们甚至可以使用 10 项查找表。）

Since 9 is a single-digit number, we're done.因为 9 是个位数，所以我们完成了。 But for a multiple-digit number, our next concern is the arithmetic we have to do: multiplying the running sum by 10, and adding in the next digit.但是对于一个多位数字，我们的下一个关注点是我们必须做的算术：将运行总和乘以 10，然后添加下一个数字。 How does that work, exactly?这到底是如何工作的？

Again, if we're writing a C (or even an assembly language) program, we don't really need to know, because our machine's floating-point 'add' and 'multiply' instructions will do everything for us.同样，如果我们正在编写 C（甚至是汇编语言）程序，我们实际上并不需要知道，因为我们机器的浮点“加法”和“乘法”指令将为我们做所有事情。 But, also again, if we had to do it ourselves, we could.但是，同样，如果我们必须自己做，我们可以。 (This answer's getting way too long, so I'm not going to discuss floating-point addition and multiplication algorithms just yet. Maybe farther down.) （这个答案太长了，所以我现在不打算讨论浮点加法和乘法算法。也许更远。）

Finally, the code as presented so far "cheated" by calling the library functions atoi and pow .最后，到目前为止的代码通过调用库函数atoi和pow被“欺骗”了。 I won't have any trouble convincing you that we could have implemented atoi ourselves if we wanted/had to: it's basically just the same digit-accumulation code we already wrote.如果我们愿意/不得不这样做，我可以毫不费力地说服您我们可以自己实现atoi ：它基本上只是我们已经编写的相同的数字累积代码。 And pow isn't too hard, either, because in our case we don't need to implement it in full generality: we're always raising to integer powers, so it's straightforward repeated multiplication, and we've already assumed we know how to do multiplication.而且pow也不是太难，因为在我们的例子中，我们不需要完全通用地实现它：我们总是提高到 integer 的幂，所以它是简单的重复乘法，我们已经假设我们知道如何做乘法。

(With that said, computing a large power of 10 as part of our decimal-to-binary algorithm is problematic. As @Eric Postpischil noted in his answer, "Normally we want to figure out the binary floating-point result without actually calculating 10 ^N ." Me, since I don't know any better, I'll compute it anyway, but if I wrote my own pow() I'd use the binary exponentiation algorithm, since it's super easy to implement and quite nicely efficient.) （话虽如此，计算 10 的大幂作为我们的十进制到二进制算法的一部分是有问题的。正如@Eric Postpischil 在他的回答中指出的那样，“通常我们想要在不实际计算 10 的情况下计算二进制浮点结果^N 。”我，因为我不知道更好，所以无论如何我都会计算它，但是如果我编写自己的pow()我会使用二进制求幂算法，因为它非常容易实现并且非常高效。 )

I said I'd discuss floating-point addition and multiplication routines.我说我会讨论浮点加法和乘法例程。 Suppose you want to add two floating-point numbers.假设您要添加两个浮点数。 If they happen to have the same exponent, it's easy: add the two significands (and keep the exponent the same), and that's your answer.如果它们碰巧有相同的指数，这很容易：添加两个有效数字（并保持指数相同），这就是你的答案。 (How do you add the significands? Well, I assume you have a way to add integers.) If the exponents are different, but relatively close to each other, you can pick the smaller one and add N to it to make it the same as the larger one, while simultaneously shifting the significand to the right by N bits. （如何添加有效数字？好吧，我假设您有一种添加整数的方法。）如果指数不同，但彼此相对接近，您可以选择较小的一个并将 N 添加到它以使其相同作为较大的一个，同时将有效位向右移动 N 位。 (You've just created a denormalized number.) Once the exponents are the same, you can add the significands, as before. （您刚刚创建了一个非规范化数字。）一旦指数相同，您可以像以前一样添加有效数字。 After the addition, it may be important to renormalize the numbers, that is, to detect if one or more leading bits ended up as 0 and, if so, shift the significand left and decrement the exponent.在加法之后，重新规范化数字可能很重要，即检测一个或多个前导位是否以 0 结尾，如果是，则将有效位左移并递减指数。 Finally, if the exponents are too different, such that shifting one significand to the right by N bits would shift it all away, this means that one number is so much smaller than the other that all of it gets lost in the roundoff when adding them.最后，如果指数相差太大，以至于将一个有效位向右移动 N 位会将其全部移开，这意味着一个数字比另一个数字小得多，以至于在添加它们时所有数字都会在四舍五入中丢失.

Multiplication: Floating-point multiplication is actually somewhat easier than addition.乘法：浮点乘法实际上比加法要容易一些。 You don't have to worry about matching up the exponents: the final product is basically a new number whose significand is the product of the two significands, and whose exponent is the sum of the two exponents.您不必担心匹配指数：最终产品基本上是一个新数字，其有效数字是两个有效数字的乘积，其指数是两个指数之和。 The only trick is that the product of the two M-bit significands is nominally 2M bits, and you may not have a multiplier that can do that.唯一的技巧是两个 M 位有效数的乘积名义上是 2M 位，而您可能没有可以做到这一点的乘法器。 If the only multiplier you have available maxes out at an M-bit product, you can take your two M-bit significands and literally split them in half by bits:如果您唯一可用的乘数在 M 位乘积上达到最大值，您可以取两个 M 位有效数，并按位将它们分成两半：

signif1 = a * 2 ^M/2 + b signif1 = a * 2 ^M/2 + b
signif2 = c * 2 ^M/2 + d signif2 = c * 2 ^M/2 + d

So by ordinary algebra we have所以通过普通代数我们有

signif1 × signif2 = ac × 2 ^M + ad × 2 ^M/2 + bc × 2 ^M/2 + bd signif1 × signif2 = ac × 2 ^M + ad × 2 ^M/2 + bc × 2 ^M/2 + bd

Each of those partial products ac , ad , etc. is an M-bit product.这些部分乘积ac 、 ad等中的每一个都是 M 位乘积。 Multiplying by 2 ^M/2 or 2 ^M is easy, because it's just a left shift.乘以 2 ^M/2或 2 ^M很容易，因为它只是左移。 And adding the terms up is something we already know how to do.加上条款是我们已经知道该怎么做的事情。 We actually only care about the upper M bits of the product, so since we're going to throw away the rest, I imagine we could cheat and skip the bd term, since it contributes nothing (although it might end up slightly influencing a properly-rounded result).我们实际上只关心产品的高 M 位，所以由于我们要丢弃 rest，我想我们可以作弊并跳过bd术语，因为它没有任何贡献（尽管它最终可能会稍微影响适当的-四舍五入的结果）。

But anyway, the details of the addition and multiplication algorithms, and the knowledge they contain about the floating-point representation we're using, end up forming the other half of the answer to the question of the decimal-to-binary "algorithm" you're looking for.但无论如何，加法和乘法算法的细节，以及它们包含的关于我们正在使用的浮点表示的知识，最终形成了十进制到二进制“算法”问题的另一半答案您正在寻找。 If you convert, say, the number 5.703125 using the code I've shown, out will pop the binary floating-point number 1.01101101 ₂ × 2 ² , but nowhere did we explicitly compute that significand 1.01101101 or that exponent 2 -- they both just fell out of all the digitwise multiplications and additions we did.例如，如果您使用我显示的代码转换数字 5.703125，out 将弹出二进制浮点数 1.01101101 ₂ × 2 ² ，但我们没有明确计算有效数字 1.01101101 或指数 2 ——它们都只是脱离了我们所做的所有数字乘法和加法。

Finally, if you're still with me, here's a quick and easy integer-power-only pow function using binary exponentiation:最后，如果你还在我身边，这里有一个使用二进制求幂的快速简单的仅整数幂pow function：

double my_pow(double a, unsigned int b)
{
    double ret = 1;
    double fac = a;

    while(1) {
        if(b & 1) ret *= fac;
        b >>= 1;
        if(b == 0) break;
        fac *= fac;
    }
    return ret;
}

This is a nifty little algorithm.这是一个漂亮的小算法。 If we ask it to compute, say, 10 ²¹ , it does not multiply 10 by itself 21 times.如果我们要求它计算 10 ²¹ ，它不会将 10 与自身相乘 21 次。 Instead, it repeatedly squares 10, leading to the exponential sequence 10 ¹ , 10 ² , 10 ⁴ , 10 ⁸ , or rather, 10, 100, 10000, 100000000... Then it looks at the binary representation of 21, namely 10101, and selects only the intermediate results 10 ¹ , 10 ⁴ , and 10 ¹⁶ to multiply into its final return value, yielding 10 ¹⁺⁴⁺¹⁶ , or 10 ²¹ , as desired.相反，它反复平方10，导致指数序列 10 ¹ , 10 ² , 10 ⁴ , 10 ⁸ ，或者更确切地说，10, 100, 10000, 100000000... 然后它查看 21 的二进制表示，即 10101，并且只选择中间结果 10 ¹ 、 10 ⁴和 10 ¹⁶乘以它的最终返回值，根据需要产生 10 ¹⁺⁴⁺¹⁶或 10 ²¹ 。 It therefore runs in time O(log ₂ (N)), not O(N).因此，它运行时间为 O(log ₂ (N))，而不是 O(N)。

And, tune in tomorrow for our next exciting episode when we'll go in the opposite direction, writing a binary-to-decimal converter which will require us to do... (ominous chord)而且，明天收看我们的下一个激动人心的情节，我们将在相反的方向 go，编写一个二进制到十进制的转换器，这需要我们做......（不祥的和弦）
floating point long division !浮点长除法！

Answer 3

Here's a completely different answer, that tries to focus on the "algorithm" part of the question.这是一个完全不同的答案，它试图关注问题的“算法”部分。 I'll start with the example you asked about, converting the decimal integer 9 to the binary scientific notation number 1.001 ₂ ×2 ³ .我将从您询问的示例开始，将十进制 integer 9转换为二进制科学计数法数字1.001 ₂ ×2 ³ 。 The algorithm is in two parts: (1) convert the decimal integer 9 to the binary integer 1001 ₂ , and (2) convert that binary integer into binary scientific notation.该算法分为两部分：（1）将十进制 integer 9转换为二进制 integer 1001 ₂ ，以及（2）将二进制 Z157DB7DF530023575515D366C9B678 转换为二进制科学记数法。

Step 1. Convert a decimal integer to a binary integer.步骤 1.将十进制 integer 转换为二进制 integer。 (You can skip over this part if you already know it. Also, although this part of the algorithm is going to look perfectly fine, it turns out it's not the sort of thing that's actually used anywhere on a practical binary computer.) （如果您已经知道，可以跳过这部分。此外，虽然这部分算法看起来非常好，但事实证明它并不是实际二进制计算机上实际使用的那种东西。）

The algorithm is built around a number we're working on, n , and a binary number we're building up, b .该算法是围绕我们正在处理的数字n和我们正在构建的二进制数b构建的。

Set n initially to the number we're converting, 9 .最初将n设置为我们要转换的数字9 。
Set b to 0.将b设置为 0。
Compute the remainder when dividing n by 2. In our example, the remainder of 9 ÷ 2 is 1.计算n除以 2 时的余数。在我们的示例中，9 ÷ 2 的余数为 1。
The remainder is one bit of our binary number.余数是我们二进制数的一位。 Tack it on to b .把它钉在b上。 In our example, b is now 1 .在我们的示例中， b现在是1 。 Also, here we're going to be tacking bits on to b on the left .另外，在这里我们将在左边的b上添加一些位。
Divide n by 2 (discarding the remainder).将n除以 2（丢弃余数）。 In our example, n is now 4.在我们的示例中， n现在是 4。
If n is now 0, we're done.如果 n 现在为 0，我们就完成了。
Go back to step 3. Go 返回步骤 3。

At the end of the first trip through the algorithm, n is 4 and b is 1.在算法的第一次行程结束时， n为 4， b为 1。

The next trip through the loop will extract the bit 0 (because 4 divided by 2 is 2, remainder 0).下一次循环将提取位 0（因为 4 除以 2 为 2，余数为 0）。 So b goes to 01, and n goes to 2.所以b到 01， n到 2。

The next trip through the loop will extract the bit 0 (because 2 divided by 2 is 1, remainder 0).下一次循环将提取位 0（因为 2 除以 2 为 1，余数为 0）。 So b goes to 001, and n goes to 1.所以b变为 001， n变为 1。

The next trip through the loop will extract the bit 1 (because 1 divided by 2 is 0, remainder 1).下一次循环将提取位 1（因为 1 除以 2 为 0，余数为 1）。 So b goes to 1001, and n goes to 0.所以b变为 1001， n变为 0。

And since n is now 0, we're done.因为n现在是 0，所以我们完成了。 Meanwhile, we've built up the binary number 1001 in b , as desired.同时，我们根据需要在b中建立了二进制数1001 。

Here's that example again, in tabular form.这又是那个例子，以表格的形式。 At each step, we compute n divided by two (or in C, n/2 ), and the remainder when dividing n by 2, which in C is n%2 .在每一步，我们计算n除以 2（或在 C， n/2中），以及n除以 2 时的余数，在 C 中为n%2 。 At the next step, n gets replaced by n/2 , and the next bit (which is n%2 ) gets tacked on at the left of b .在下一步， n被n/2替换，下一位（即n%2 ）被添加到b的左侧。

step       n       b     n/2     n%2
   0       9       0       4       1
   1       4       1       2       0
   2       2      01       1       0
   3       1     001       0       1
   4       0    1001

Let's run through that again, for the number 25:让我们再看一遍，对于数字 25：

step       n       b     n/2     n%2
   0      25       0      12       1
   1      12       1       6       0
   2       6      01       3       0
   3       3     001       1       1
   4       1    1001       0       1
   5       0   11001

You can clearly see that the n column is driven by the n/2 column, because in step 5 of the algorithm as stated we divided n by 2. (In C this would be n = n / 2 , or n /= 2 .) You can clearly see the binary result appearing (in right-to-left order) in the n%2 column.您可以清楚地看到n列由n/2列驱动，因为在算法的第 5 步中，我们将n除以 2。（在 C 中，这将是n = n / 2或n /= 2 。 ) 您可以清楚地看到在n%2列中出现的二进制结果（按从右到左的顺序）。

So that's one way to convert decimal integers to binary.所以这是将十进制整数转换为二进制的一种方法。 (As I mentioned, though, it's likely not the way your computer does it. Among other things, the act of tacking a bit on to the left end of b turns out to be rather unorthodox.) （不过，正如我所提到的，这可能不是您的计算机的方式。除此之外，在b的左端添加一点的行为被证明是相当不正统的。）

Step 2. Convert a binary integer to a binary number in scientific notation.步骤 2.将二进制 integer 转换为科学计数法的二进制数。

Before we begin with this half of the algorithm, it's important to realize that scientific (or "exponential") representations are typically not unique.在我们开始算法的这半部分之前，重要的是要意识到科学（或“指数”）表示通常不是唯一的。 Returning to decimal for a moment, let's think about the number "one thousand".暂时回到十进制，让我们考虑一下“一千”这个数字。 Most often we'll represent that as 1 × 10 ³ .大多数情况下，我们将其表示为 1 × 10 ³ 。 But we could also represent it as 10 × 10 ² , or 100 × 10 ¹ , or even crazier representations like 10000 × 10 ^-1 , or 0.01 × 10 ⁵ .但我们也可以将其表示为 10 × 10 ²或 100 × 10 ¹ ，甚至更疯狂的表示，如 10000 × 10 ^-1或 0.01 × 10 ⁵ 。

So, in practice, when we're working in scientific notation, we'll usually set up an additional rule or guideline, stating that we'll try to keep the mantissa (also called the "significand") within a certain range.因此，在实践中，当我们使用科学计数法时，我们通常会设置一个额外的规则或指南，说明我们会尽量将尾数（也称为“有效数字”）保持在一定范围内。 For base 10, usually the goal is either to keep it in the range 0 ≤ mantissa < 10, or 0 ≤ mantissa < 1. That is, we like numbers like 1 × 10 ³ or 0.1 × 10 ⁴ , but we don't like numbers like 100 × 10 ¹ or 0.01 × 10 ⁵ .对于以 10 为底的数字，通常目标是将其保持在 0 ≤ 尾数 < 10 或 0 ≤ 尾数 < 1 的范围内。也就是说，我们喜欢 1 × 10 ³或 0.1 × 10 ⁴这样的数字，但我们不喜欢像 100 × 10 ¹或 0.01 × 10 ⁵这样的数字。

How do we keep our representations in the range we like?我们如何将我们的表示保持在我们喜欢的范围内？ What if we've got a number (perhaps the intermediate result of a calculation) that's in a form we don't like?如果我们有一个我们不喜欢的形式的数字（可能是计算的中间结果）怎么办？ The answer is simple, and it depends on a pattern you've probably already noticed: If you multiply the mantissa by 10, and if you simultaneously subtract 1 from the exponent, you haven't changed the value of the number.答案很简单，它取决于您可能已经注意到的一种模式：如果您将尾数乘以 10，并且同时从指数中减去 1，那么您并没有改变数字的值。 Similarly, you can divide the mantissa by 10 and increment the exponent, again without changing anything.同样，您可以将尾数除以 10 并增加指数，同样无需更改任何内容。

When we convert a scientific-notation number into the form we like, we say we're normalizing the number.当我们将科学记数法数字转换为我们喜欢的形式时，我们说我们正在对数字进行规范化。

One more thing: since 10 ⁰ is 1, we can preliminarily convert any integer to scientific notation by simply multiplying it by 10 ⁰ .还有一件事：由于 10 ⁰是 1，我们可以通过简单地将任何integer 简单地乘以 10 ⁰将其初步转换为科学计数法。 That is, 9 is 9×10 ⁰ , and 25 is 25×10 ⁰ .也就是说，9 是 9×10 ⁰ ，25 是 25×10 ⁰ 。 If we do it that way we'll usually get a number that's in a form we "don't like" (that is "nonnormalized"), but now we have an idea of how to fix that.如果我们这样做，我们通常会得到一个我们“不喜欢”（即“非规范化”）形式的数字，但现在我们知道如何解决这个问题了。

So let's return to base 2, and the rest of this second half of our algorithm.所以让我们回到基数 2，以及我们算法的后半部分的 rest。 Everything we've said so far about decimal scientific notation is also true about binary scientific notation, as long as we make the obvious changes of "10" to "2".到目前为止，我们所说的关于十进制科学记数法的一切对于二进制科学记数法也是如此，只要我们将“10”明显地更改为“2”。

To convert the binary integer 1001 ₂ to binary scientific notation, we first multiply it by 2 ⁰ , resulting in: 1001 ₂ ×2 ⁰ .要将二进制 integer 1001 ₂转换为二进制科学计数法，我们首先将其乘以 2 ⁰ ，得到：1001 ₂ ×2 ⁰ 。 So actually we're almost done, except that this number is nonnormalized.所以实际上我们几乎完成了，除了这个数字是非标准化的。

What's our definition of a normalized base-two scientific notation number?我们对标准化的以二为底的科学记数法数的定义是什么？ We haven't said, but the requirement is usually that the mantissa is between 0 and 10 ₂ (that is, between 0 and 2 ₁₀ ), or stated another way, that the high-order bit of the mantissa is always 1 (unless the whole number is 0).我们没有说，但要求通常是尾数在 0 到 10 ₂之间（即 0 到 2 ₁₀之间），或者换一种说法，尾数的高位始终为 1（除非整数为 0)。 That is, these mantissas are normalized: 1.001 ₂ , 1.1 ₂ , 1.0 ₂ , 0.0 ₂ .也就是说，这些尾数被归一化：1.001 ₂ , 1.1 ₂ , 1.0 ₂ , 0.0 ₂ 。 These mantissas are nonnormalized: 10.01 ₂ , 0.001 ₂ .这些尾数是非标准化的： 10.01 ₂ , 0.001 ₂ 。

So to normalize a number, we may need to multiply or divide the mantissa by 2, while incrementing or decrementing the exponent.因此，为了标准化一个数字，我们可能需要将尾数乘以或除以 2，同时递增或递减指数。

Putting this all together in step-by-step form: to convert a binary integer to a binary scientific number:一步一步地把这一切放在一起：将二进制 integer 转换为二进制科学数：

Multiply the integer by 2 ⁰ : set the mantissa to the number we're converting, and the exponent to 0.将 integer 乘以 2 ⁰ ：将尾数设置为我们要转换的数字，将指数设置为 0。
If the number is normalized (if the mantissa is 0, or if its leading bit is 1), we're done.如果数字被归一化（如果尾数为 0，或者如果它的前导位为 1），我们就完成了。
If the mantissa has more than one bit to the left of the decimal point (really the "radix point" or "binary point"), divide the mantissa by 2, and increment the exponent by 1. Return to step 2.如果尾数在小数点左侧多于一位（实际上是“小数点”或“二进制点”），则将尾数除以 2，指数加 1。返回步骤 2。
(This step will never be necessary if the number we started with was an integer.) If the mantissa is nonzero but the bit to the left of the radix point is 0, multiply the mantissa by 2, and decrement the exponent by 1. Return to step 2. （如果我们开始的数字是 integer，则此步骤将永远不需要。）如果尾数不为零但小数点左侧的位为 0，则将尾数乘以 2，然后将指数减 1。返回到第 2 步。

Running this algorithm in tabular form for our number 9, we have:以表格形式为我们的数字 9 运行这个算法，我们有：

step  mantissa  exponent
   0     1001.         0
   1     100.1         1
   2     10.01         2
   3     1.001         3

So, if you're still with me, that's how we can convert the decimal integer 9 to the binary scientific notation (or floating-point) number 1.001 ₂ ×2 ³ .所以，如果你还在我身边，这就是我们如何将十进制 integer 9转换为二进制科学记数法（或浮点）数1.001 ₂ ×2 ³ 。

And, with all of that said, the algorithm as stated so far only works for decimal integers .而且，综上所述，到目前为止所述的算法仅适用于十进制整数。 What if we wanted to convert, say, the decimal number 1.25 to the binary number 1.01 ₂ ×2 ⁰ , or 34.125 to 1.00010001 ₂ ×2 ⁵ ?如果我们想将十进制数 1.25 转换为二进制数 1.01 ₂ ×2 ⁰或 34.125 到 1.00010001 ₂ ×2 ⁵怎么办？ That's a discussion that will have to wait for another day (or for this other answer ), I guess.我猜这是一个必须等待另一天（或另一个答案）的讨论。

浮点数如何转换为科学记数法进行存储？

问题描述

3 个解决方案

解决方案1
2 2019-10-09 17:04:47

解决方案2
1 2019-10-10 03:42:13

解决方案3
0 2019-10-16 12:36:22

浮点数如何转换为科学记数法进行存储？

问题描述

3 个解决方案

解决方案1 2 2019-10-09 17:04:47

解决方案2 1 2019-10-10 03:42:13

解决方案3 0 2019-10-16 12:36:22

解决方案1
2 2019-10-09 17:04:47

解决方案2
1 2019-10-10 03:42:13

解决方案3
0 2019-10-16 12:36:22