简体   繁体   English

朴素贝叶斯文本分类算法

[英]Naive Bayes Text Classification Algorithm

Hye there! 惠呢! I just need the help for implementing Naive Bayes Text Classification Algorithm in Java to just test my Data Set for research purposes. 我只需要帮助在Java中实现朴素贝叶斯文本分类算法来测试我的数据集用于研究目的。 It is compulsory to implement the algorithm in Java; 在Java中实现算法是必须的; rather using Weka or Rapid Miner tools to get the results! 而是使用Weka或Rapid Miner工具来获得结果!


My Data Set has the following type of Data: 我的数据集具有以下类型的数据:

    Doc  Words   Category

Means that I have the Training Words and Categories for each training (String) known in advance. 意味着我预先知道每个训练(String)的训练单词和类别。 Some of the Data Set is given below: 下面给出了一些数据集:

             Doc      Words                                                              Category        
    Training
               1      Integration Communities Process Oriented Structures...(more string)       A
               2      Integration Communities Process Oriented Structures...(more string)       A
               3      Theory Upper Bound Routing Estimate global routing...(more string)        B
               4      Hardware Design Functional Programming Perfect Match...(more string)      C
               .
               .
               .
    Test
               5      Methodology Toolkit Integrate Technological  Organisational
               6      This test contain string naive bayes test text text test

SO the Data Set comes from a MySQL DataBase and it may contain multiple training strings and test strings as well! 因此数据集来自MySQL数据库,它可能包含多个训练字符串和测试字符串! The thing is I just need to implement Naive Bayes Text Classification Algorithm in Java. 问题是我只需要在Java中实现朴素贝叶斯文本分类算法。

The algorithm should follow the following example mentioned here Table 13.1 该算法应遵循提到下面的例子在这里 表13.1

Source: Read here 来源:请在这里阅读


The thing is that I can implement the algorithm in Java Code myself but i just need to know if it is possible that there exist some kind a Java library with source code documentation available to allow me to just test the results. 问题是我可以自己在Java代码中实现该算法,但我只需要知道是否有可能存在某种带有源代码文档的Java库,以便我只测试结果。

The problem is I just need the results for just one time only means its just a test for results. 问题是我只需要一次结果只意味着它只是对结果的测试。

So, come to the point can somebody tell me about any good java library that helps my code this algorithm in Java and that could made my dataset possible to process the results, or can somebody give me any good ideas how to do it easily...something good that can help me. 所以,有人可以告诉我有关任何优秀的Java库,它可以帮助我在Java中编写这个算法,并且可以使我的数据集可以处理结果,或者有人可以给我任何好的想法如何轻松地做到这一点。一些可以帮助我的好东西。

I will be thankful for your help. 我将感谢你的帮助。 Thanks in advance 提前致谢

As per your requirement, you can use the Machine learning library MLlib from apache. 根据您的要求,您可以使用apache中的机器学习库MLlib The MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities. MLlib是Spark的可扩展机器学习库,由通用学习算法和实用程序组成。 There is also a java code template to implement the algorithm utilizing the library. 还有一个java代码模板来实现利用该库的算法。 So to begin with, you can: 首先,您可以:

Implement the java skeleton for the Naive Bayes provided on their site as given below. 为其网站上提供的Naive Bayes实现java骨架,如下所示。

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;

JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set

final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);

JavaPairRDD<Double, Double> predictionAndLabel = 
  test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
    @Override public Tuple2<Double, Double> call(LabeledPoint p) {
      return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
    }
  });
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
    @Override public Boolean call(Tuple2<Double, Double> pl) {
      return pl._1().equals(pl._2());
    }
  }).count() / (double) test.count();

For testing your datasets, there is no best solution here than use the Spark SQL . 为了测试数据集,这里没有比使用Spark SQL更好的解决方案。 MLlib fits into Spark's APIs perfectly. MLlib非常适合Spark的API。 To start using it, I would recommend you to go through the MLlib API first, implementing the Algorithm according to your needs. 要开始使用它,我建议您首先完成MLlib API ,根据您的需要实现算法。 This is pretty easy using the library. 使用该库非常容易。 For the next step to allow the processing of your datasets possible, just use the Spark SQL . 对于允许处理数据集的下一步,只需使用Spark SQL即可 I will recommend you to stick to this. 我会建议你坚持这个。 I too have hunted down multiple options before settling for this easy to use library and it's seamless support for inter-operations with some other technologies. 在确定这个易于使用的库之前,我也已经找到了多个选项,它无缝支持与其他一些技术的互操作。 I would have posted the complete code here to perfectly fit your answer. 我会在这里发布完整的代码,以完全符合您的答案。 But I think you are good to go. 但我觉得你很高兴。

You can use the Weka Java API and include it in your project if you do not want to use the GUI. 如果您不想使用GUI,可以使用Weka Java API并将其包含在项目中。

Here's a link to the documentation to incorporate a classifier in your code: https://weka.wikispaces.com/Use+WEKA+in+your+Java+code 以下是在您的代码中包含分类器的文档的链接: https//weka.wikispaces.com/Use+WEKA+in+your+Java+code

Please take a look at the Bow toolkit . 请看一下Bow工具包

It has a Gnu license and source code. 它有一个Gnu许可证和源代码。 Some of its code includes 它的一些代码包括

Setting word vector weights according to Naive Bayes, TFIDF, and several other methods. 根据Naive Bayes,TFIDF和其他几种方法设置单词矢量权重。

Performing test/train splits, and automatic classification tests. 执行测试/火车拆分和自动分类测试。

It's not a Java library, but you could compile the C code to ensure that you Java had similar results for a given corpus. 它不是Java库,但您可以编译C代码以确保Java对于给定的语料库具有类似的结果。

I also spotted a decent Dr. Dobbs article that implements in Perl. 我还发现了一篇在Perl中实现的Dobbs博士文章 Once again, not the desired Java, but will give you the one-time results that you are asking for. 再次,不是所需的Java,但会给你一次性的结果,你要求的。

嗨,我认为Spark会帮助你很多: http//spark.apache.org/docs/1.2.0/mllib-naive-bayes.html你甚至可以选择你认为最适合你需要的语言Java / Python / Scala!

Please use scipy from python. 请使用python中的scipy。 There is already an implementation of what you need: 已经实现了您的需求:

class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)¶

scipy SciPy的

You can use an algorithm platform like KNIME, it has variety of classification algorithms (Naive bayed included). 您可以使用像KNIME这样的算法平台,它有各种分类算法(包括Naive bayed)。 You can run it with a GUI or Java API. 您可以使用GUI或Java API运行它。

If you want to implement Naive Bayes Text Classification Algorithm in Java, then WEKA Java API will be a better solution. 如果要在Java中实现朴素贝叶斯文本分类算法,那么WEKA Java API将是更好的解决方案。 The data set should have to be in .arff format. 数据集必须采用.arff格式。 Creating an .arff file from mySql database is very easy. 从mySql数据库创建.arff文件非常简单。 Here is the attachment of the java code for the classifier a link of a sample .arff file. 以下是分类器的java代码附件,示例.arff文件的链接。

Create a new Text document. 创建一个新的文本文档。 Open it with Notepad. 用记事本打开它。 Copy and paste all the texts below the link. 复制并粘贴链接下方的所有文本。 Save it as DataSet.arff. 将其另存为DataSet.arff。 http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff

Download Weka Java API: http://www.java2s.com/Code/Jar/w/weka.htm 下载Weka Java API: http//www.java2s.com/Code/Jar/w/weka.htm

Code for the classifier: 分类器的代码:

public static void main(String[] args) {

    try {
        StringBuilder txtAreaShow = new StringBuilder();

        //reads the arff file
        BufferedReader breader = null;
        breader = new BufferedReader(new FileReader("DataSet.arff"));

        //if 40 attributes availabe then  39 will be the class index/attribuites(yes/no)
        Instances train = new Instances(breader);
        train.setClassIndex(train.numAttributes() - 1);
        breader.close();

        //
        NaiveBayes nB = new NaiveBayes();
        nB.buildClassifier(train);

        Evaluation eval = new Evaluation(train);
        eval.crossValidateModel(nB, train, 10, new Random(1));

        System.out.println("Run Information\n=====================");
        System.out.println("Scheme: " + train.getClass().getName());
        System.out.println("Relation: ");

        System.out.println("\nClassifier Model(full training set)\n===============================");
        System.out.println(nB);

        System.out.println(eval.toSummaryString("\nSummary Results\n==================", true));
        System.out.println(eval.toClassDetailsString());
        System.out.println(eval.toMatrixString());


        //txtArea output 
        txtAreaShow.append("\n\n\n");
        txtAreaShow.append("Run Information\n===================\n");
        txtAreaShow.append("Scheme: " + train.getClass().getName());

        txtAreaShow.append("\n\nClassifier Model(full training set)"
                + "\n======================================\n");
        txtAreaShow.append("" + nB);

        txtAreaShow.append(eval.toSummaryString("\n\nSummary Results\n==================\n", true));
        txtAreaShow.append(eval.toClassDetailsString());
        txtAreaShow.append(eval.toMatrixString());
        txtAreaShow.append("\n\n\n");

        System.out.println(txtAreaShow.toString());

    } catch (FileNotFoundException ex) {
        System.err.println("File not found");
        System.exit(1);
    } catch (IOException ex) {
        System.err.println("Invalid input or output.");
        System.exit(1);
    } catch (Exception ex) {
        System.err.println("Exception occured!");
        System.exit(1);
    }

You can take a look at Blayze - It's a pretty minimal Naive Bayes library for the JVM written in Kotlin. 你可以看看Blayze - 这是一个用Kotlin编写的JVM的极小Naive Bayes库。 Should be easy to follow. 应该很容易遵循。

Full disclosure: I'm one of the authors of Blayze 完全披露:我是Blayze的作者之一

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 朴素贝叶斯文本分类计算,最好在MySQL或Java中执行 - Naive bayes text classification calculation, better to do in MySQL or java 在Java中使用朴素贝叶斯(weka)进行简单文本分类 - Simple text classification using naive bayes (weka) in java Mahout Naive Bayes CSV分类 - Mahout Naive Bayes CSV Classification 理解算法-多项式朴素贝叶斯 - Understanding algorithm - Multinomial Naive Bayes 使用Mallet进行朴素贝叶斯分类:如何以及在哪里设置字母? - Using Mallet for Naive Bayes classification: How and where are Alphabets set up? 在Weka的朴素贝叶斯中对单个文本文档进行分类 - classify a single text document in naive bayes in weka 朴素贝叶斯文本分类器-确定何时应将文档标记为“未分类” - Naive Bayes Text Classifier - determining when a document should be labelled 'unclassified' Spark MLlib的朴素贝叶斯 - Naive Bayes in Spark MLlib 为什么我的bagOfWord天真贝叶斯算法比Wekas StringToWordVector表现差? - Why is my bagOfWord naive bayes algorithm performing worse than wekas StringToWordVector? 当您使用朴素贝叶斯算法拼错一个词时,获取字典(数据库)的接近词 - get close words of a dictionary (database) when you misspell a word using Naive Bayes algorithm
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM