A simple floating-point addition x+y in with precision 4 (ie IEEE mantissa width 3), with 3 bits for exponent ( emax=3
, emin=-4
) for x = mpfr('0.75')
, y = mpfr('0.03125')
incorrectly gives mpfr('0.75')
as result when it should be mpfr('0.8125')
. Note that 0.3125
is a subnormal number for this reduced precision format.
Edit: Terminal interaction extracted from link and included for future reference.
>>> "{0:.10Df}".format(mpfr('0.75')+mpfr('0.03125'))
'0.7500000000'
>>> get_context()
context(precision=4, real_prec=Default, imag_prec=Default,
round=RoundToNearest, real_round=Default, imag_round=Default,
emax=3, emin=-4,
subnormalize=True,
trap_underflow=False, underflow=False,
trap_overflow=False, overflow=False,
trap_inexact=False, inexact=True,
trap_invalid=False, invalid=False,
trap_erange=False, erange=False,
trap_divzero=False, divzero=False,
trap_expbound=False,
allow_complex=False)
>>>
Disclaimer: I maintain gmpy2.
I believe it is a bug with creating subnormals from a string. I think it is fixed in the development code but I won't be able to test until later. I'll update this answer later.
Update
The problem is not related to creating a subnormal from a string. In this case, the subnormal value is created properly. In gmpy2 2.0.x, there is a rare bug when converted a string to a subnormal. The simplest work-around is to convert the input to an mpq
type first; ie mpfr(mpq('0.03125'))
.
The actual problem is the default rounding mode. The intermediate sum is exactly halfway between two 4 bit values. The default rounding mode of RoundToNearest
selects the rounded value with final bit of 0. If you change the rounding mode to RoundUp
, you get the expected result.
>>> from gmpy2 import *
>>> ctx=context(emax=4, emin=-4, precision=4)
>>> set_context(ctx)
>>> a=mpfr('0.75')
>>> b=mpfr('0.03125')
>>> "{0:.10Df}".format(a+b)
'0.7500000000'
>>> get_context().round=RoundUp
>>> "{0:.10Df}".format(a+b)
'0.8125000000'
One last comment: the values of precision
, emax
and emin
are slight different between the IEEE standards and the MPFR library. If e
is the exponent size and p
is the precision (in IEEE terms), then precision
should be p+1
, emax
should be 2**(e-1)
and emin
should be 4-emax-precision
. This doesn't impact your question since it only changes emax
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.