简体   繁体   English

解释e1071中naiveBayes分类器返回的条件概率:R

[英]Interpreting conditional probabilities returned by naiveBayes classifier in e1071:R

Working on a classification solution using the following process: 使用以下过程处理分类解决方案:

a. 一种。 Perform Naive Bayes classification in R using e1071. 使用e1071在R中执行朴素贝叶斯分类。

b. Get the a-priori table and conditional probabilities tables 获取先验表和条件概率表

c. C。 Use the values for prediction using a PL/SQL program within an application. 使用应用程序中的PL / SQL程序进行预测的值。 ie Eventual prediction will not involve usage of the R predict function. 即,最终预测将不涉及R预测函数的使用。

In step b, am seeing negative and greater than 1 conditional probabilities returned by R after model generation - are they really conditional probabilities? 在步骤b中,我看到模型生成后R返回的负数和大于1个条件概率 - 它们是否真的是条件概率?

Illustrating the issue with 2 data sets - one that I am able to interpret and one that I am unable to interpret. 用2个数据集说明问题 - 一个我能够解释,一个我无法解释。

Data set 1: Fruit identification ( saw this in a nice Naive Bayes illustration in this forum) 数据集1:水果鉴定(在这个论坛中看到一个漂亮的Naive Bayes插图)

Data Frame Fruit_All: 

Long    Sweet   Yellow  Fruit

Yes Yes Yes Banana

Yes Yes Yes Banana

Yes Yes Yes Banana

Yes Yes Yes Banana

No  Yes Yes Banana

No  Yes Yes Orange

No  Yes Yes Orange

No  Yes Yes Orange

Yes Yes Yes Other

No  Yes No  Other

Yes Yes Yes Banana

Yes Yes Yes Banana

Yes No  Yes Banana

Yes No  No  Banana

No  No  Yes Banana

No  No  Yes Orange

No  No  Yes Orange

No  No  Yes Orange

Yes Yes No  Other

No  No  No  Other

Performing Naive Bayes classification: 执行朴素贝叶斯分类:

  `NB.fit <- naiveBayes(Fruit~., data=Fruit_All,laplace=0)`

where Fruit is the class column, Fruit_All is the complete data frame. Fruit是Class列,Fruit_All是完整的数据框。

The returned conditional probabilities in NB.fit are exactly as expected. NB.fit中返回的条件概率与预期完全一致。

Also, all the row probabilities neatly add up to 1. eg0.1 + 0.9 for Banana+Yellow 此外,所有行概率整齐地加起来为1.例如香蕉+黄的0.1 + 0.9

Conditional probabilities: 条件概率:

        Long        
Y         No Yes        
  Banana 0.2 0.8        
  Orange 1.0 0.0        
  Other  0.5 0.5        

        Sweet       
Y          No  Yes      
  Banana 0.30 0.70      
  Orange 0.50 0.50      
  Other  0.25 0.75      

        Yellow      
Y          No  Yes      
  Banana 0.10 0.90      
  Orange 0.00 1.00      
  Other  0.75 0.25      

A-priori probabilities:         

Banana Orange  Other            
   0.5    0.3    0.2    

I can use the above to easily write code to predict the outcome for an input provided eg For Long, Sweet and Yellow all equal to yes. 我可以使用上面的代码轻松编写代码来预测输入的结果,例如For Long,Sweet和Yellow都等于yes。

The fruit for which this product is maximum : 该产品最大的水果:

P(Long|Fruit) * P(Sweet|Fruit) * P(Yellow|Fruit) * apriori P(Fruit)

Data Set 2: Iris data set available in R 数据集2:R中可用的虹膜数据集

  `NB.fit <- naiveBayes(Species~., data=iris)`

Conditional probabilities: 条件概率:

         Sepal.Length
Y             [,1]      [,2]

  setosa     5.006 0.3524897

  versicolor 5.936 0.5161711

  virginica  6.588 0.6358796

            Sepal.Width
Y             [,1]      [,2]

  setosa     3.428 0.3790644

  versicolor 2.770 0.3137983

  virginica  2.974 0.3224966

            Petal.Length
Y             [,1]      [,2]

  setosa     1.462 0.1736640

  versicolor 4.260 0.4699110

  virginica  5.552 0.5518947

            Petal.Width
Y             [,1]      [,2]

  setosa     0.246 0.1053856

  versicolor 1.326 0.1977527

  virginica  2.026 0.2746501

In this case, the same function doesn't seem to be returning conditional probabilities as some of the values are greater than 1 and none of the rows add up to 1. 在这种情况下,相同的函数似乎不返回条件概率,因为某些值大于1且没有行加起来为1。

Note: If I use the predict function in R , I get correct results as predictions for Iris. 注意:如果我在R中使用预测函数,我会得到正确的结果作为Iris的预测。

I understand the Iris data set is a bit different as the variables are continuous numeric values and not factors unlike the fruit example. 我理解Iris数据集有点不同,因为变量是连续数值而不是与果实示例不同的因素。

For other complex data sets, I even see negative values as conditional probabilities returned by the classifier. 对于其他复杂数据集,我甚至将负值视为分类器返回的条件概率。 Though the final result is fine within R. 虽然最终结果在R内很好。

Questions: 问题:

Are the conditional probabilities returned for the Iris data set really conditional probabilities? 为Iris数据集返回的条件概率是否真的是条件概率?

Will the same product maximization I did in the fruit example hold good for Iris and even for data sets where the conditional probabilities are negative? 我在水果示例中所做的相同产品最大化是否适用于Iris,甚至是条件概率为负的数据集?

Is it possible to write a custom prediction function based on the Iris conditional probability tables? 是否可以根据Iris条件概率表编写自定义预测函数?

This answer is just about a year late but I just stumbled upon it. 这个答案迟了大约一年,但我偶然发现了它。 As you write, the predictors are numeric and are therefore treated differently that factors. 在您编写时,预测变量是数字的,因此对因子的处理方式不同。 What you get are the means (first columns) and sd's (second column) of the conditional Gaussian distributions. 你得到的是条件高斯分布的均值(第一列)和sd(第二列)。 Thus, for 因此,为

            Petal.Width
Y             [,1]      [,2]

  setosa     0.246 0.1053856

We have that the mean Petal Width is 0.246 and the standard deviation is 0.10. 我们得到平均花瓣宽度为0.246,标准偏差为0.10。 You can see that too from 你也可以看到

> iris %>% dplyr::filter(Species=="setosa") %>% 
           dplyr::summarize(mean(Petal.Width), sd(Petal.Width))
  mean(Petal.Width) sd(Petal.Width)
1             0.246       0.1053856

The Gaussian density is used to invert the conditional probability using Bayes formula to obtain the proper conditional probabilities. 高斯密度用于使用贝叶斯公式反转条件概率以获得适当的条件概率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM