简体   繁体   中英

Weird formatting issue in hadoop map/reduce program in java

I have a csv file with following sample records.

| publisher  | site               | ad clicks | ad views |
|============|====================|===========|==========|
| publisher1 | www.sampleSite.com |        50 |       75 |
| publisher1 | www.sampleSite2.com|        10 |       40 |
| publisher2 | www.newSite1.com   |       100 |      175 |
| publisher2 | www.newSite2.com   |        50 |       65 |

Using map/reduce in java, I am trying to sum all ad clicks and ad views for every publisher. So output should be like this

publisher1 60, 115
publisher2 150, 240

I have written following code.

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class SGSClickViewStats
{
    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> 
    {
        int recNo = 0;
        private Text publisherName = new Text();
        private Text mapOpValue    = new Text();

        public void map(LongWritable key, Text inputs, OutputCollector <Text, Text> output, Reporter rptr) throws IOException{
            String line = inputs.toString();
            String [] fields = line.split(",");
            String pubName = formatStats(fields[0]);
            String click   = fields[2];
            String views   = fields[3];
            // ***** send stats to reducer as a string separated by :
            String value   = click+":"+views;

            mapOpValue.set(formatStats(value));
            publisherName.set(pubName);   

            output.collect(publisherName, mapOpValue);
        }

        private String formatStats(String stat) {
            while((stat.indexOf("\"") >= 0) && (stat.indexOf(",")) >= 0){
                stat = stat.replace("\"","");
                stat = stat.replace(",","");
            }
            return stat;
        }
    }

    public static class Reduce extends MapReduceBase implements Reducer< Text, Text, Text, Text >
    {
        private Text pubName = new Text();
        public void reduce(Text key, Iterator<Text> value, OutputCollector<Text, Text> oc, Reporter rptr) throws IOException {
            int views     = 0;
            int clicks    = 0;
            String val    = "";
            String opVal  = "";
            Text textOpVal= new Text();

            while(value.hasNext()){
                val = value.next().toString();

                String [] tokens = val.split(":");

                try {
                    clicks = clicks + Integer.parseInt(tokens[0]);
                    views  = views  + Integer.parseInt(tokens[1]);
                } catch (Exception e) {
                    System.out.println("This is Command HQ, code red\nError Message: "+e.getLocalizedMessage()+" Error class: "+e.getClass()+"Extra, Array length: "+tokens.length);
                }
            }           

            try {
                            // ******* want to separate stats by comma but can't !!
                opVal = Integer.toString(clicks) + ":"+ Integer.toString(views);
            } catch (Exception e) {
                System.out.println("This is Command HQ, code Yellow\nError Message: "+e.getLocalizedMessage()+" Error class: "+e.getClass());
            }
            textOpVal.set(opVal);
            oc.collect(key, textOpVal);     
        }
    }

    public static void main(String [] args) throws Exception {
        JobConf jc = new JobConf(SGSClickViewStats.class);
        jc.setJobName("SGSClickViewStats");
        jc.setOutputKeyClass(Text.class);
        jc.setOutputValueClass(Text.class);

        jc.setMapperClass(Map.class);
        jc.setReducerClass(Reduce.class);
        jc.setCombinerClass(Reduce.class);

        jc.setInputFormat(TextInputFormat.class);
        jc.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(jc, new Path(args[0]));
        FileOutputFormat.setOutputPath(jc, new Path(args[1]));

        JobClient.runJob(jc);
    }
}

This program is working fine, but in output of reducer, I can't have final stats separated by a comma, which is at second comment with * * . If I do that, all my stats become 0, 0. I get this error when I try to comma separate.

Error Message: For input string: "50, 75" 
Error class: class java.lang.NumberFormatExceptionExtra, Array length: 1

Array length is length of tokens array in reducer function, as I am sending output from mapper to reducer colon (:) separated tokens should have 2 elements, I see one when I set output of reducer comma separated.

I have referred many articles, but I couldn't find an answer. I sincerely hope that someone helps !! :)

Why are you using your Reducer as combiner? By the time you data come to Reduce phase it is already "publisher\\tclicks,views" format I guess that could be causing problem.

Can you comment following line and check?

jc.setCombinerClass(Reduce.class);

NumberFormatException is definitely thown by Integer.parseInt , so your error must be in the first try when you are computing the clicks and views sum. Check the output passed by the mapper. I'm pretty sure you are not formatting the strings correctly in the mapper.

Edit : To make it clear for future readers: the problem was the usage of the Reducer class also as a Combiner by mistake, thus producing a different output than expected from the map phase.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM