简体   繁体   English

C中精度浮点运算的问题

[英]Problem with Precision floating point operation in C

For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data. 对于我的课程项目之一,我开始在C中实现“朴素贝叶斯分类器”。我的项目是使用大量训练数据实现文档分类器应用程序(尤其是垃圾邮件)。

Now I have problem implementing the algorithm because of the limitations in the C's datatype. 由于C数据类型的限制,现在我在实现算法时遇到了问题。

( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering ) (我在这里使用的算法, http://en.wikipedia.org/wiki/Bayesian_spam_filtering

PROBLEM STATEMENT: The algorithm involves taking each word in a document and calculating probability of it being spam word. 问题陈述:该算法涉及获取文档中的每个单词并计算它是垃圾词的概率。 If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. 如果p1,p2 p3 .... pn是字-1,2,3 ... n的概率。 The probability of doc being spam or not is calculated using 使用以下方法计算doc是否为垃圾邮件的概率

替代文字

Here, probability value can be very easily around 0.01. 这里,概率值可以非常容易地在0.01左右。 So even if I use datatype "double" my calculation will go for a toss. 因此,即使我使用数据类型“double”,我的计算也会进行折腾。 To confirm this I wrote a sample code given below. 为了证实这一点,我写了一个给出的示例代码。

#define PROBABILITY_OF_UNLIKELY_SPAM_WORD     (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD     (0.99)

int main()
{
    int index;
    long double numerator = 1.0;
    long double denom1 = 1.0, denom2 = 1.0;
    long double doc_spam_prob;

    /* Simulating FEW unlikely spam words  */
    for(index = 0; index < 162; index++)
    {
        numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
        denom2    = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
        denom1    = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
    }
    /* Simulating lot of mostly definite spam words  */
    for (index = 0; index < 1000; index++)
    {
        numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
        denom2    = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
        denom1    = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
    }
    doc_spam_prob= (numerator/(denom1+denom2));
    return 0;
}

I tried Float, double and even long double datatypes but still same problem. 我尝试了Float,double甚至是long double数据类型,但仍然存在同样的问题。

Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!. 因此,在我正在分析的100K字文件中,如果只有162个单词具有1%的垃圾邮件概率而剩余的99838个是明显的垃圾邮件单词,那么我的应用程序仍然会因为精度错误而将其称为非垃圾邮件doc(因为分子容易进行)到零)!!!

This is the first time I am hitting such issue. 这是我第一次遇到这样的问题。 So how exactly should this problem be tackled? 那么究竟应该如何解决这个问题呢?

This happens often in machine learning. 这通常发生在机器学习中。 AFAIK, there's nothing you can do about the loss in precision. AFAIK,关于精度的损失你无能为力。 So to bypass this, we use the log function and convert divisions and multiplications to subtractions and additions, resp. 因此,为了绕过这一点,我们使用log函数并将除法和乘法转换为减法和加法。

SO I decided to do the math, 所以我决定做数学,

The original equation is: 原始等式是:

问题

I slightly modify it: 我稍微修改一下:

在此输入图像描述

Taking logs on both sides: 记录两侧的日志:

在此输入图像描述

Let, 让,

在此输入图像描述

Substituting, 代,

在此输入图像描述

Hence the alternate formula for computing the combined probability: 因此,计算组合概率的替代公式:

在此输入图像描述

If you need me to expand on this, please leave a comment. 如果您需要我对此进行扩展,请发表评论。

Here's a trick: 这是一个诀窍:

for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n), 
then we have:

  p = S / (S + H)
  p = 1 / ((S + H) / S)
  p = 1 / (1 + H / S)

let`s expand again:

  p = 1 / (1 +  ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
  p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)

So basically, you will obtain a product of quite large numbers (between 0 and, for p_i = 0.01 , 99 ). 所以基本上,你将获得相当大的数的乘积(之间0 ,并且对于p_i = 0.0199 )。 The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0 , but to make a quotient of two small numbers. 这个想法是,不要将大量的小数字彼此相乘,以获得0 ,但是要得到两个小数的商。 For example, if n = 1000000 and p_i = 0.5 for all i , the above method will give you 0/(0+0) which is NaN , whereas the proposed method will give you 1/(1+1*...1) , which is 0.5 . 例如,如果n = 1000000 and p_i = 0.5 for all in = 1000000 and p_i = 0.5 for all i ,则上述方法将给出0/(0+0) ,即NaN ,而建议的方法将给出1/(1+1*...1) ,这是0.5

You can get even better results, when all p_i are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n ), then the following formula will get even better precision: 你可以得到更好的结果,当所有的p_i都被排序并且你以相反的顺序将它们配对时(让我们假设p_1 < ... < p_n ),那么下面的公式将获得更好的精度:

  p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)

that way you devide big numerators (small p_i ) with big denominators (big p_(n+1-i) ), and small numerators with small denominators. 这样你就可以将大分子(小p_i )与大分母(大p_(n+1-i) )和小分子与小分母分开。

edit: MSalter proposed a useful further optimization in his answer. 编辑: MSalter在他的回答中提出了一个有用的进一步优化。 Using it, the formula reads as follows: 使用它,公式如下:

  p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)

Your problem is caused because you are collecting too many terms without regard for their size. 您的问题是由于您收集太多条款而不考虑其大小而引起的。 One solution is to take logarithms. 一种解决方案是采用对数。 Another is to sort your individual terms. 另一个是对您的个人条款进行排序。 First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i) . 首先,让我们将等式重写为1/p = 1 + ∏((1-p_i)/p_i) Now your problem is that some of the terms are small, while others are big. 现在你的问题是某些术语很小,而其他术语很大。 If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result. 如果你连续有太多的小术语,你会下流,而且有太多大术语你会溢出中间结果。

So, don't put too many of the same order in a row. 所以,不要连续放入太多相同的订单。 Sort the terms (1-p_i)/p_i . 对术语(1-p_i)/p_i排序。 As a result, the first will be the smallest term, the last the biggest. 结果,第一个是最小的,最后一个是最大的。 Now, if you'd multiply them straight away you would still have an underflow. 现在,如果你马上将它们相乘,你仍然会有下溢。 But the order of calculation doesn't matter. 但计算顺序无关紧要。 Use two iterators into your temporary collection. 在临时集合中使用两个迭代器。 One starts at the beginning (ie (1-p_0)/p_0 ), the other at the end (ie (1-p_n)/p_n ), and your intermediate result starts at 1.0 . 一个(1-p_0)/p_0开始(即(1-p_0)/p_0 ),另一个在结尾(即(1-p_n)/p_n ),你的中间结果从1.0开始。 Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back. 现在,当您的中间结果> = 1.0时,您从前面获取一个术语,当您的中间结果<1.0时,您从后面获取结果。

The result is that as you take terms, the intermediate result will oscillate around 1.0. 结果是,当您使用术语时,中间结果将在1.0附近振荡。 It will only go up or down as you run out of small or big terms. 当你用完小或大的时候,它只会上升或下降。 But that's OK. 但那没关系。 At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result. 那时,你已经消耗了两端的极值,因此中间结果将慢慢接近最终结果。

There's of course a real possibility of overflow. 当然有溢出的可能性。 If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. 如果输入完全不可能是垃圾邮件(p = 1E-1000),则1/p将溢出,因为∏((1-p_i)/p_i)溢出。 But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. 但由于这些术语已经排序,我们知道只有∏((1-p_i)/p_i)溢出时,中间结果才会溢出。 So, if the intermediate result overflows, there's no subsequent loss of precision. 因此,如果中间结果溢出,则不会出现后续的精度损失。

Try computing the inverse 1/p. 尝试计算逆1 / p。 That gives you an equation of the form 1 + 1/(1-p1)*(1-p2)... 这给你一个形式1 + 1 /(1-p1)*(1-p2)的等式......

If you then count the occurrence of each probability--it looks like you have a small number of values that recur--you can use the pow() function--pow(1-p, occurences_of_p)*pow(1-q, occurrences_of_q)--and avoid individual roundoff with each multiplication. 如果你然后计算每个概率的出现 - 看起来你有少量值重复 - 你可以使用pow()函数 - pow(1-p,occurences_of_p)* pow(1-q, occurrences_ofs) - 并避免每次乘法的单独舍入。

You can use probability in percents or promiles: 您可以使用概率百分比或promiles:

doc_spam_prob= (numerator*100/(denom1+denom2));

or 要么

doc_spam_prob= (numerator*1000/(denom1+denom2));

or use some other coefficient 或使用其他系数

I am not strong in math so I cannot comment on possible simplifications to the formula that might eliminate or reduce your problem. 我的数学能力不强,所以我无法对可能消除或减少问题的公式进行简化评论。 However, I am familiar with the precision limitations of long double types and am aware of several arbitrary and extended precision math libraries for C. Check out: 但是,我熟悉long double类型的精度限制,并且知道C的几个任意和扩展的精度数学库。检查:

http://www.nongnu.org/hpalib/ and http://www.tc.umn.edu/~ringx004/mapm-main.html http://www.nongnu.org/hpalib/http://www.tc.umn.edu/~ringx004/mapm-main.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM