Mahout 0.9为朴素贝叶斯分类文档

Question

I'm very new to Mahout and I am working on classifying unstructured text documents. 我对Mahout并不陌生，我正在对非结构化文本文档进行分类。

I have followed this tutorial as I am using a Naive Bayes Model. 我使用的是朴素贝叶斯模型，因此已按照本教程进行操作。 I have gotten to the point of training my classifier but I am not sure how to convert a new document into a tfidf vector for classifying. 我已经到了训练分类器的地步了，但是我不确定如何将新文档转换成用于分类的tfidf向量。

My data is stored as a TSV file which has a label and the text corresponding to it. 我的数据存储为TSV文件，其中包含标签和与之对应的文本。 I use seq2parse to create the tfidf vectors that is required for training the model. 我使用seq2parse创建训练模型所需的seq2parse向量。

I then train the model using these tfidf vectors which results in a Naive Bayes model. 然后，我使用这些tfidf向量训练模型，从而得出朴素贝叶斯模型。

Now I have a new unlabelled text document that I wish to classify using this trained model but I am not sure how to convert it into a tfidf vector. 现在，我有一个新的未标记文本文档，希望使用该训练模型进行分类，但是我不确定如何将其转换为tfidf向量。 If I use seq2parse again then it will create a new set of dictionary file etc and I assume then this doesn't correspond to the dictionary created for the training set. 如果我再次使用seq2parse ，它将创建一组新的字典文件等，并且我认为这与为训练集创建的字典不对应。

I have seen a manual implementation of creating the tfidf based on an already created dictionary file and label index at https://github.com/fredang/mahout-naive-bayes-example/blob/master/src/main/java/com/chimpler/example/bayes/Classifier.java but I was wondering if Mahout has already provided some methods to do this just the way they have provided the seq2parse . 我已经在https://github.com/fredang/mahout-naive-bayes-example/blob/master/src/main/java/com上看到了基于已创建的字典文件和标签索引创建tfidf的手动实现。 /chimpler/example/bayes/Classifier.java，但我想知道Mahout是否已经提供了一些方法，就像他们提供seq2parse 。 I would rather use a supporting method of doing it than having to do it manually. 我宁愿使用一种辅助方法来执行此操作，也不必手动进行操作。

Answer 1

The sample code can help u, perhaps: 示例代码可以帮助您，也许：

org.apache.mahout.math.Vector vector = new RandomAccessSparseVector();
    Integer wordId = dictionary.get(word);  // use hashcode of word

    double tfIdfValue = tfidf.calculate(count, freq.intValue(),
            wordCount, documentCount); // calculate tf*idf

    vector.set(wordId,tfIdfValue);

// Model is a matrix (wordId, labelId) => probability score
NaiveBayesModel model = NaiveBayesModel.materialize(
        new Path(modelPath), configuration);
StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(
        model);

// With the classifier, we get one score for each label.The label with
// the highest score is the one the tweet is more likely to be
// associated to
Vector resultVector = classifier.classifyFull(vector);

Mahout 0.9为朴素贝叶斯分类文档

问题描述

1 个解决方案

解决方案1
0 2016-01-06 10:44:16

Mahout 0.9为朴素贝叶斯分类文档

问题描述

1 个解决方案

解决方案1 0 2016-01-06 10:44:16

解决方案1
0 2016-01-06 10:44:16