简体   繁体   中英

How to parse a CSV file so that can be classified by Mahout

I am trying to classify a CSV file using Mahout, my understanding is that, first I need to convert the data in the CSV into vectors that can then be used by one of the mahout classification algorithms. My CSV file consists of text and word-like values and multiple classes.

在此输入图像描述

I have searched here and found some vague explanations on how to do this but couldn't find any examples. Can anyone please provide a simple example in how to accomplish this? or is there any utility available that does this for you?.

I was asuming this would be a very common task but couldn't really find any clear examples.

Any help will be greatly appreciated.

You have some text and word-like value so you should probably use the 20 news-group example to get inspired. It is a nice example and you can easily reproduce a code with your csv file from it.

Here is a working link of the last version of mahout for the 20 news-group:

https://github.com/jpatanooga/MahoutExamples/blob/master/src/main/java/com/cloudera/mahout/classification/sgd/TwentyNewsgroups.java

There is just an adaptation to make with the countWords method with the changes of TokenSream object, here is a working code with last version of Mahout:

private static void countWords(Analyzer analyzer, Collection<String> words, Reader in) throws IOException {

        // use the provided analyzer to tokenize the input stream
        TokenStream ts = analyzer.tokenStream("text", in);
        ts.addAttribute(CharTermAttribute.class);
        ts.reset();

        // for each word in the stream, minus non-word stuff, add word to collection
        while (ts.incrementToken()) {
            String s = ts.getAttribute(CharTermAttribute.class).toString();
            words.add(s);
        }
        ts.end();
        ts.close();

        /*overallCounts.addAll(words);*/
    } 

I hope it will help you. I used this example to adapt with a CSV file and it worked.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM