朴素贝叶斯文本分类计算，最好在MySQL或Java中执行

Question

The calculation for class conditional probability in naive bayes is 朴素贝叶斯中的类条件概率计算为

P(t|c) = Log2((n1+1)/(n2+n3))

Where 哪里

t = token x; t =令牌x; c = class x c =等级x
n1 = number of token x in class x n1 =类x中的令牌数量x
n2 = number of all token in class x n2 =类x中所有令牌的数量
n3 = number of all token in all class n3 =所有类别中所有令牌的数量

Which one is faster, doing calculation in MySQL or in Java (of course we need to grab data from MySQL to use it in Java)? 使用MySQL或Java（当然，我们需要从MySQL抓取数据才能在Java中使用它）进行计算，哪个更快？

Answer 1

The Naive Bayes classifier is computationally simple, but it requires lots of data manipulations. 朴素贝叶斯分类器在计算上很简单，但是需要大量的数据操作。 When applied to text, you are generally looking for a lot of different terms inside the text. 当应用于文本时，通常会在文本内寻找许多不同的术语。

I have a natural bias toward doing these types of calculations in SQL. 对于在SQL中进行这些类型的计算，我有一种自然的偏见。 I would at least argue that MySQL is a reasonable environment for doing this. 我至少认为MySQL是执行此操作的合理环境。 Depending on the exact nature of the problem and the structure of your data, you might find that full text indexing is helpful. 根据问题的确切性质和数据的结构，您可能会发现全文索引会有所帮助。 I would be wary about working with a large corpus (many tens or hundreds of gigabytes) on the application side. 我会担心在应用程序端使用大型语料库（数十或数百GB）。 My book "Data Analysis Using SQL and Excel" has a chapter devoted to Naive Bayes and similar types of models. 我的《使用SQL和Excel进行数据分析》一书专门论述了朴素贝叶斯和类似类型的模型。

朴素贝叶斯文本分类计算，最好在MySQL或Java中执行

问题描述

1 个解决方案

解决方案1
1 2014-02-28 15:12:13

朴素贝叶斯文本分类计算，最好在MySQL或Java中执行

问题描述

1 个解决方案

解决方案1 1 2014-02-28 15:12:13

解决方案1
1 2014-02-28 15:12:13