使用无符号整数需要浮点精度

Question

I'm working with a microchip that doesn't have room for floating point precision, however. 然而，我正在使用没有浮点精度空间的微芯片。 I need to account for fractional values during some equations. 我需要考虑一些方程式中的小数值。 So far I've had good luck using the old *100 -> /100 method like so: 到目前为止，我很幸运使用旧的* 100 - > / 100方法，如下所示：

increment = (short int)(((value1 - value2)*100 / totalSteps));

// later in the code I loop through the number of totolSteps
// adding back the increment to arrive at the total I want at the precise time
// time I need it. 
newValue = oldValue + (increment / 100);

This works great for values from 0-255 divided by a totalSteps of up to 300. After 300, the fractional values to the right of the decimal place, become important, because they add up over time of course. 这适用于0-255之间的值除以最多300的totalSteps。在300之后，小数点右边的小数值变得很重要，因为它们会随着时间的推移而累加。

I'm curious if anyone has a better way to save decimal accuracy within an integer paradigm? 我很好奇是否有人有更好的方法在整数范式内保存小数精度？ I tried using *1000 /1000, but that didn't work at all. 我尝试使用* 1000/1000，但这根本不起作用。

Thank you in advance. 先感谢您。

Answer 1

Fractions with integers is called fixed point math. 具有整数的分数称为定点数学。

Try Googling "fixed point". 尝试谷歌搜索“固定点”。

Fixed point tips and tricks are out of the scope of SO answer... 定点提示和技巧超出了SO答案的范围......

Example: 5 tap FIR filter 示例：5点击FIR滤波器

// C is the filter coefficients using 2.8 fixed precision. // C是使用2.8固定精度的滤波器系数。 // 2 MSB (of 10) is for integer part and 8 LSB (of 10) is the fraction part. // 2 MSB（10）是整数部分，8 LSB（10）是分数部分。 // Actual fraction precision here is 1/256. //这里的实际分数精度是1/256。

int FIR_5(int* in,    // input samples
          int inPrec, // sample fraction precision
          int* c,     // filter coefficients
          int cPrec)  // coefficients fraction precision
{
    const int coefHalf = (cPrec > 0) ? 1 << (cPrec - 1) : 0; // value of 0.5 using cPrec
    int sum = 0; 
    for ( int i = 0; i < 5; ++i )
    {
        sum += in[i] * c[i];
    }

    // sum's precision is X.N. where N = inPrec + cPrec;
    // return to original precision (inPrec)
    sum = (sum + coefHalf) >> cPrec; // adding coefHalf for rounding
    return sum;
}

int main()
{
    const int filterPrec = 8;
    int C[5] = { 8, 16, 208, 16, 8 }; // 1.0 == 256 in 2.8 fixed point. Filter value are 8/256, 16/256, 208/256, etc.
    int W[5] = { 10, 203, 40, 50, 72}; // A sampling window (example)
    int res = FIR_5(W, 0, C, filterPrec);
    return 0;
}

Notes: 笔记：

In the above example: 在上面的例子中：

the samples are integers (no fraction) 样本是整数（没有分数）
the coefs have fractions of 8 bit. coefs有8位的分数。
8 bit fractions mean that each change of 1 is treated as 1/256. 8位分数意味着1每个变化被视为1/256。 1 << 8 == 256 . 1 << 8 == 256 。
Useful notation is Y.Xu or Y.Xs. 有用的表示法是Y.Xu或Y.Xs. where Y is how many bits are allocated for the integer part and X for he fraction. 其中Y是为整数部分分配了多少位，为分数分配了X. u/s denote signed/unsigned. u / s表示签名/未签名。
when multiplying 2 fixed point numbers, their precision (size of fraction bits) are added to each other. 当乘以2个定点数时，它们的精度（分数位的大小）相互相加。
Example A is 0.8u, B is 0.2U. 实施例A为0.8u，B为0.2U。 C=A*B. C = A * B。 C is 0.10u C为0.10u
when dividing, use a shift operation to lower the result precision. 分割时，使用移位操作降低结果精度。 Amount of shifting is up to you. 转移量取决于您。 Before lowering precision it's better to add a half to lower the error. 在降低精度之前，最好添加half以降低误差。
Example: A=129 in 0.8u which is a little over 0.5 (129/256). 示例：A = 129，0.8u，略高于0.5（129/256）。 We want the integer part so we right shift it by 8. Before that we want to add a half which is 128 (1<<7). 我们想要整数部分，所以我们将它右移8.在此之前我们想要添加一个128（1 << 7）的half 。 So A = (A + 128) >> 8 --> 1. 所以A =（A + 128）>> 8 - > 1。
Without adding a half you'll get a larger error in the final result. 如果不添加一半，您将在最终结果中获得更大的错误。

Answer 2

Don't use this approach. 不要使用这种方法。

New paradigm: Do not accumulate using FP math or fixed point math. 新范例：不要使用FP数学或定点数学累积。 Do your accumulation and other equations with integer math. 用整数数学做累积和其他方程。 Anytime you need to get some scaled value, divide by your scale factor (100), but do the "add up" part with the raw, unscaled values. 任何时候你需要获得一些缩放值，除以你的比例因子（100），但用原始的，未缩放的值做“加”部分。

Answer 3

Here's a quick attempt at a precise rational (Bresenham-esque) version of the interpolation if you truly cannot afford to directly interpolate at each step. 如果您真的无法在每一步直接插值，那么可以快速尝试插值的精确理性（Bresenham-esque）版本。

div_t frac_step = div(target - source, num_steps);
if(frac_step.rem < 0) {
    // Annoying special case to deal with rounding towards zero.
    // Alternatively check for the error term slipping to < -num_steps as well
    frac_step.rem = -frac_step.rem;
    --frac_step.quot;
}

unsigned int error = 0;

do {
    // Add the integer term plus an accumulated fraction
    error += frac_step.rem;
    if(error >= num_steps) {
        // Time to carry
        error -= num_steps;
        ++source;
    }
    source += frac_step.quot;
} while(--num_steps);

A major drawback compared to the fixed-point solution is that the fractional term gets rounded off between iterations if you are using the function to continually walk towards a moving target at differing step lengths. 与定点解决方案相比的一个主要缺点是，如果您使用该函数以不同的步长连续走向移动目标，则分数项在迭代之间得到舍入。

Oh, and for the record your original code does not seem to be properly accumulating the fractions when stepping, eg a 1/100 increment will always be truncated to 0 in the addition no matter how many times the step is taken. 哦，并且为了记录，你的原始代码在步进时似乎没有正确地累积分数，例如，无论步骤采取多少次，在增加中总是将1/100增量截断为0。 Instead you really want to add the increment to a higher-precision fixed-point accumulator and then divide it by 100 (or preferably right shift to divide by a power-of-two) each iteration in order to compute the integer "position". 相反，你真的想要将增量添加到更高精度的定点累加器，然后在每次迭代时将其除以100（或者最好是右移以除以2的幂），以便计算整数“位置”。

Do take care with the different integer types and ranges required in your calculations. 请注意计算中所需的不同整数类型和范围。 A multiplication by 1000 will overflow a 16-bit integer unless one term is a long. 乘以1000将溢出16位整数，除非一个项是long。 Go through you calculations and keep track of input ranges and the headroom at each step, then select your integer types to match. 完成计算并跟踪每一步的输入范围和余量，然后选择要匹配的整数类型。

Answer 4

Maybe you can simulate floating point behaviour by saving it using the IEEE 754 specification 也许您可以通过使用IEEE 754规范保存浮点行为来模拟浮点行为

So you save mantisse, exponent, and sign as unsigned int values. 因此，您将mantisse，exponent和sign保存为unsigned int值。

For calculation you use then bitwise addition of mantisse and exponent and so on. 为了计算，你使用然后按位添加mantisse和指数等等。 Multiplication and Division you can replace by bitwise addition operations. 乘法和除法可以通过按位加法运算替换。

I think it is a lot of programming staff to emulate that but it should work. 我认为很多编程人员都会模仿，但它应该有效。

Answer 5

Your choice of type is the problem: short int is likely to be 16 bits wide. 您选择的类型是问题： short int可能是16位宽。 That's why large multipliers don't work - you're limited to +/-32767. 这就是为什么大型乘法器不起作用的原因 - 你被限制在+/- 32767。 Use a 32 bit long int , assuming that your compiler supports it. 使用32位long int ，假设您的编译器支持它。 What chip is it, by the way, and what compiler? 顺便说一下，它是什么芯片，什么编译器？

使用无符号整数需要浮点精度

问题描述

5 个解决方案

解决方案1
2 2013-12-11 13:34:40

解决方案2
1 2013-12-11 16:20:06

解决方案3
1 2013-12-11 16:29:44

解决方案4
0 2013-12-11 13:39:17

解决方案5
0 2013-12-11 16:26:09

使用无符号整数需要浮点精度

问题描述

5 个解决方案

解决方案1 2 2013-12-11 13:34:40

解决方案2 1 2013-12-11 16:20:06

解决方案3 1 2013-12-11 16:29:44

解决方案4 0 2013-12-11 13:39:17

解决方案5 0 2013-12-11 16:26:09

解决方案1
2 2013-12-11 13:34:40

解决方案2
1 2013-12-11 16:20:06

解决方案3
1 2013-12-11 16:29:44

解决方案4
0 2013-12-11 13:39:17

解决方案5
0 2013-12-11 16:26:09