简体   繁体   English

朴素贝叶斯文本分类计算,最好在MySQL或Java中执行

[英]Naive bayes text classification calculation, better to do in MySQL or java

The calculation for class conditional probability in naive bayes is 朴素贝叶斯中的类条件概率计算为

P(t|c) = Log2((n1+1)/(n2+n3))

Where 哪里

  1. t = token x; t =令牌x; c = class x c =等级x
  2. n1 = number of token x in class x n1 =类x中的令牌数量x
  3. n2 = number of all token in class x n2 =类x中所有令牌的数量
  4. n3 = number of all token in all class n3 =所有类别中所有令牌的数量

Which one is faster, doing calculation in MySQL or in Java (of course we need to grab data from MySQL to use it in Java)? 使用MySQL或Java(当然,我们需要从MySQL抓取数据才能在Java中使用它)进行计算,哪个更快?

The Naive Bayes classifier is computationally simple, but it requires lots of data manipulations. 朴素贝叶斯分类器在计算上很简单,但是需要大量的数据操作。 When applied to text, you are generally looking for a lot of different terms inside the text. 当应用于文本时,通常会在文本内寻找许多不同的术语。

I have a natural bias toward doing these types of calculations in SQL. 对于在SQL中进行这些类型的计算,我有一种自然的偏见。 I would at least argue that MySQL is a reasonable environment for doing this. 我至少认为MySQL是执行此操作的合理环境。 Depending on the exact nature of the problem and the structure of your data, you might find that full text indexing is helpful. 根据问题的确切性质和数据的结构,您可能会发现全文索引会有所帮助。 I would be wary about working with a large corpus (many tens or hundreds of gigabytes) on the application side. 我会担心在应用程序端使用大型语料库(数十或数百GB)。 My book "Data Analysis Using SQL and Excel" has a chapter devoted to Naive Bayes and similar types of models. 我的《使用SQL和Excel进行数据分析》一书专门论述了朴素贝叶斯和类似类型的模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM