I am trying to classify a CSV file using Mahout, my understanding is that, first I need to convert the data in the CSV into vectors that can then be used by one of the mahout classification algorithms. My CSV file consists of text and word-like values and multiple classes.
I have searched here and found some vague explanations on how to do this but couldn't find any examples. Can anyone please provide a simple example in how to accomplish this? or is there any utility available that does this for you?.
I was asuming this would be a very common task but couldn't really find any clear examples.
Any help will be greatly appreciated.
You have some text and word-like value so you should probably use the 20 news-group example to get inspired. It is a nice example and you can easily reproduce a code with your csv file from it.
Here is a working link of the last version of mahout for the 20 news-group:
There is just an adaptation to make with the countWords method with the changes of TokenSream object, here is a working code with last version of Mahout:
private static void countWords(Analyzer analyzer, Collection<String> words, Reader in) throws IOException {
// use the provided analyzer to tokenize the input stream
TokenStream ts = analyzer.tokenStream("text", in);
ts.addAttribute(CharTermAttribute.class);
ts.reset();
// for each word in the stream, minus non-word stuff, add word to collection
while (ts.incrementToken()) {
String s = ts.getAttribute(CharTermAttribute.class).toString();
words.add(s);
}
ts.end();
ts.close();
/*overallCounts.addAll(words);*/
}
I hope it will help you. I used this example to adapt with a CSV file and it worked.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.