简体   繁体   English

Mahout 0.9为朴素贝叶斯分类文档

[英]Mahout 0.9 classify document for Naive Bayes

I'm very new to Mahout and I am working on classifying unstructured text documents. 我对Mahout并不陌生,我正在对非结构化文本文档进行分类。

I have followed this tutorial as I am using a Naive Bayes Model. 我使用的是朴素贝叶斯模型,因此已按照本教程进行操作。 I have gotten to the point of training my classifier but I am not sure how to convert a new document into a tfidf vector for classifying. 我已经到了训练分类器的地步了,但是我不确定如何将新文档转换成用于分类的tfidf向量。

My data is stored as a TSV file which has a label and the text corresponding to it. 我的数据存储为TSV文件,其中包含标签和与之对应的文本。 I use seq2parse to create the tfidf vectors that is required for training the model. 我使用seq2parse创建训练模型所需的seq2parse向量。

I then train the model using these tfidf vectors which results in a Naive Bayes model. 然后,我使用这些tfidf向量训练模型,从而得出朴素贝叶斯模型。

Now I have a new unlabelled text document that I wish to classify using this trained model but I am not sure how to convert it into a tfidf vector. 现在,我有一个新的未标记文本文档,希望使用该训练模型进行分类,但是我不确定如何将其转换为tfidf向量。 If I use seq2parse again then it will create a new set of dictionary file etc and I assume then this doesn't correspond to the dictionary created for the training set. 如果我再次使用seq2parse ,它将创建一组新的字典文件等,并且我认为这与为训练集创建的字典不对应。

I have seen a manual implementation of creating the tfidf based on an already created dictionary file and label index at https://github.com/fredang/mahout-naive-bayes-example/blob/master/src/main/java/com/chimpler/example/bayes/Classifier.java but I was wondering if Mahout has already provided some methods to do this just the way they have provided the seq2parse . 我已经在https://github.com/fredang/mahout-naive-bayes-example/blob/master/src/main/java/com上看到了基于已创建的字典文件和标签索引创建tfidf的手动实现。 /chimpler/example/bayes/Classifier.java,但我想知道Mahout是否已经提供了一些方法,就像他们提供seq2parse I would rather use a supporting method of doing it than having to do it manually. 我宁愿使用一种辅助方法来执行此操作,也不必手动进行操作。

The sample code can help u, perhaps: 示例代码可以帮助您,也许:

org.apache.mahout.math.Vector vector = new RandomAccessSparseVector();
    Integer wordId = dictionary.get(word);  // use hashcode of word

    double tfIdfValue = tfidf.calculate(count, freq.intValue(),
            wordCount, documentCount); // calculate tf*idf

    vector.set(wordId,tfIdfValue);

// Model is a matrix (wordId, labelId) => probability score
NaiveBayesModel model = NaiveBayesModel.materialize(
        new Path(modelPath), configuration);
StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(
        model);

// With the classifier, we get one score for each label.The label with
// the highest score is the one the tweet is more likely to be
// associated to
Vector resultVector = classifier.classifyFull(vector);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM