[英]Mahout 0.9 classify document for Naive Bayes
I'm very new to Mahout and I am working on classifying unstructured text documents. 我对Mahout并不陌生,我正在对非结构化文本文档进行分类。
I have followed this tutorial as I am using a Naive Bayes Model. 我使用的是朴素贝叶斯模型,因此已按照本教程进行操作。 I have gotten to the point of training my classifier but I am not sure how to convert a new document into a tfidf vector for classifying.
我已经到了训练分类器的地步了,但是我不确定如何将新文档转换成用于分类的tfidf向量。
My data is stored as a TSV file which has a label and the text corresponding to it. 我的数据存储为TSV文件,其中包含标签和与之对应的文本。 I use
seq2parse
to create the tfidf vectors that is required for training the model. 我使用
seq2parse
创建训练模型所需的seq2parse
向量。
I then train the model using these tfidf vectors which results in a Naive Bayes model. 然后,我使用这些tfidf向量训练模型,从而得出朴素贝叶斯模型。
Now I have a new unlabelled text document that I wish to classify using this trained model but I am not sure how to convert it into a tfidf vector. 现在,我有一个新的未标记文本文档,希望使用该训练模型进行分类,但是我不确定如何将其转换为tfidf向量。 If I use
seq2parse
again then it will create a new set of dictionary file etc and I assume then this doesn't correspond to the dictionary created for the training set. 如果我再次使用
seq2parse
,它将创建一组新的字典文件等,并且我认为这与为训练集创建的字典不对应。
I have seen a manual implementation of creating the tfidf based on an already created dictionary file and label index at https://github.com/fredang/mahout-naive-bayes-example/blob/master/src/main/java/com/chimpler/example/bayes/Classifier.java but I was wondering if Mahout has already provided some methods to do this just the way they have provided the seq2parse
. 我已经在https://github.com/fredang/mahout-naive-bayes-example/blob/master/src/main/java/com上看到了基于已创建的字典文件和标签索引创建tfidf的手动实现。 /chimpler/example/bayes/Classifier.java,但我想知道Mahout是否已经提供了一些方法,就像他们提供
seq2parse
。 I would rather use a supporting method of doing it than having to do it manually. 我宁愿使用一种辅助方法来执行此操作,也不必手动进行操作。
The sample code can help u, perhaps: 示例代码可以帮助您,也许:
org.apache.mahout.math.Vector vector = new RandomAccessSparseVector();
Integer wordId = dictionary.get(word); // use hashcode of word
double tfIdfValue = tfidf.calculate(count, freq.intValue(),
wordCount, documentCount); // calculate tf*idf
vector.set(wordId,tfIdfValue);
// Model is a matrix (wordId, labelId) => probability score
NaiveBayesModel model = NaiveBayesModel.materialize(
new Path(modelPath), configuration);
StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(
model);
// With the classifier, we get one score for each label.The label with
// the highest score is the one the tweet is more likely to be
// associated to
Vector resultVector = classifier.classifyFull(vector);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.