understanding Fixed point arithmetic

Question

I am struggling with how to implement arithmetic on fixed-point numbers of different precision. I have read the paper by R. Yates , but I'm still lost. In what follows, I use Yates's notation, in which A(n,m) designates a signed fixed-point format with n integer bits, m fraction bits, and n + m + 1 bits overall.

Short question : How exactly is a A(a,b)*A(c,d) and A(a,b)+A(c,d) carried out when a != c and b != d ?

Long question : In my FFT algorithm, I am generating a random signal having values between -10V and 10V signed input(in) which is scaled to A(15,16) , and the twiddle factors (tw) are scaled to A(2,29) . Both are stored as int s. Something like this:

float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG;
int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits)));

And similarly for the twiddle factors.

Now I need to perform

res = a*tw
Questions :
a) how do I implement this?
b) Should the size of res be 64 bit?
c) can I make 'res' A(17,14) since I know the ranges of a and tw ? if yes, should I be scaling a*tw by 2^14 to store correct value in res ?
a + res
Questions :
a) How do I add these two numbers of different Q formats?
b) if not, how do I do this operation?

Answer 1

Maybe it's easiest to make an example.

Suppose you want to add two numbers, one in the format A(3, 5) , and the other in the format A(2, 10) .

You can do it by converting both numbers to a "common" format - that is, they should have the same number of bits in the fractional part.

A conservative way of doing that is to choose the greater number of bits. That is, convert the first number to A(3, 10) by shifting it 5 bits left. Then, add the second number.

The result of an addition has the range of the greater format, plus 1 bit. In my example, if you add A(3, 10) and A(2, 10) , the result has the format A(4, 10) .

I call this the "conservative" way because you cannot lose information - it guarantees that the result is representable in the fixed-point format, without losing precision. However, in practice, you will want to use smaller formats for your calculation results. To do that, consider these ideas:

You can use the less-accurate format as your common representation. In my example, you can convert the second number to A(2, 5) by shifting the integer right by 5 bits. This will lose precision, and usually this precision loss is not problematic, because you are going to add a less-precise number to it anyway.
You can use 1 fewer bit for the integer part of the result. In applications, it often happens that the result cannot be too big. In this case, you can allocate 1 fewer bit to represent it. You might want to check if the result is too big, and clamp it to the needed range.

Now, on multiplication.

It's possible to multiply two fixed-point numbers directly - they can be in any format. The format of the result is the "sum of the input formats" - all the parts added together - and add 1 to the integer part. In my example, multiplying A(3, 5) with A(2, 10) gives a number in the format A(7, 15) . This is a conservative rule - the output format is able to store the result without loss of precision, but in applications, almost always you want to cut the precision of the output, because it's just too many bits.

In your case, where the number of bits for all numbers is 32, you probably want to lose precision in such a way that all intermediate results have 32 bits.

For example, multiplying A(17, 14) with A(2, 29) gives A(20, 43) - 64 bits required. You probably should cut 32 bits from it, and throw away the rest. What is the range of the result? If your twiddle factor is a number up to 4, the result is probably limited by 2^19 (the conservative number 20 above is needed to accommodate the edge case of multiplying -1 << 31 by -1 << 31 - it's almost always worth rejecting this edge-case).

So use A(19, 12) for your output format, ie remove 31 bits from the fractional part of your output.

So, instead of

res = a*tw;

you probably want

int64_t res_tmp = (int64_t)a * tw;      // A(20, 43)
if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case
    --res_tmp;                          // A(19, 43)
int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12)

Answer 2

Your question seems to assume that there is a single right way to perform the operations you are interested in, but you are explicitly asking about some of the details that direct how the operations should be performed. Perhaps this is the kernel of your confusion.

res = a*tw

a is represented as A(15,16) and tw is represented as A(2,29), so the its natural representation of their product A(18,45). You need more value bits (as many bits as the two factors have combined) to maintain full precision. A(18,45) is how you should interpret the result of widening your int s to a 64-bit signed integer type (eg int64_t ) and computing their product.

If you don't actually need or want 45 bits of fraction, then you can indeed round that to A(18,13) (or to A(18+x,13-x) for any non-negative x) without changing the magnitude of the result. That does requiring scaling. I would probably implement it like this:

/*
 * Computes a magnitude-preserving fixed-point product of any two signed
 * fixed-point numbers with a combined 31 (or fewer) value bits.  If x
 * is represented as A(s,t) and y is represented as A(u,v),
 * where s + t == u + v == 31, then the representation of the result is
 * A(s + u + 1, t + v - 32).
 */
int32_t fixed_product(int32_t x, int32_t y) {
    int64_t full_product = (int64_t) x * (int64_t) y;
    int32_t truncated = full_product / (1U << 31);
    int round_up = ((uint32_t) full_product) >> 31;

    return truncated + round_up;
}

That avoids several potential issues and implementation-defined characteristics of signed integer arithmetic. It assumes that you want the results to be in a consistent format (that is, depending only on the formats of the inputs, not on their actual values), without overflowing.

a + res

Addition is actually a little harder if you cannot rely on the operands to initially have the same scale. You need to rescale so that they match before you can perform the addition. In the general case, you may not be able to do that without rounding away some precision.

In your case, you start with one A(15,16) and one A(18,13). You can compute an intermediate result in A(19,16) or wider (presumably A(47,16) in practice) that preserves magnitude without losing any precision, but if you want to represent that in 32 bits then the best you can do without risk of changing the magnitude is A(19,11). That would be this:

int32_t a_plus_res(int32_t a, int32_t res) {
    int64_t res16 = ((int64_t) res) * (1 << 3);
    int64_t sum16 = a + res16;
    int round_up = (((uint32_t) sum16) >> 4) & 1;

    return (int32_t) ((sum16 / (1 << 5)) + round_up);
}

A generic version would need to accept the scales of the operands' representations as additional arguments. Such a thing is possible, but the above is enough to chew on as it is.

All of the foregoing assumes that the fixed-point format for each operand and result is constant. That is more or less the distinguishing feature of fixed-point, differentiating it from floating-point formats on one hand and from arbitrary-precision formats on the other. You do, however, have the alternative of allowing formats to vary, and tracking them with a separate variable per value. That would be basically a hybrid of fixed-point and arbitrary-precision formats, and it would be messier.

Additionally, the foregoing assumes that overflow must be avoided at all costs. It would also be possible to instead put operands and results on a consistent scale; this would make addition simpler and multiplication more complicated, and it would afford the possibility of arithmetic overflow. That might nevertheless be acceptable if you have reason to believe that such overflow is unlikely for your particular data.

understanding Fixed point arithmetic

Question

2 answers

solution1
1 ACCPTED 2016-06-27 15:46:28

solution2
0 2016-06-27 16:28:45

understanding Fixed point arithmetic

Question

2 answers

solution1 1 ACCPTED 2016-06-27 15:46:28

solution2 0 2016-06-27 16:28:45

solution1
1 ACCPTED 2016-06-27 15:46:28

solution2
0 2016-06-27 16:28:45