简体   繁体   English

Java hadoop map / reduce程序中的奇怪格式化问题

[英]Weird formatting issue in hadoop map/reduce program in java

I have a csv file with following sample records. 我有以下示例记录的csv文件。

| publisher  | site               | ad clicks | ad views |
|============|====================|===========|==========|
| publisher1 | www.sampleSite.com |        50 |       75 |
| publisher1 | www.sampleSite2.com|        10 |       40 |
| publisher2 | www.newSite1.com   |       100 |      175 |
| publisher2 | www.newSite2.com   |        50 |       65 |

Using map/reduce in java, I am trying to sum all ad clicks and ad views for every publisher. 我在Java中使用map / reduce,试图对每个发布者的所有广告点击次数和广告观看次数进行汇总。 So output should be like this 所以输出应该是这样的

publisher1 60, 115
publisher2 150, 240

I have written following code. 我写了下面的代码。

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class SGSClickViewStats
{
    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> 
    {
        int recNo = 0;
        private Text publisherName = new Text();
        private Text mapOpValue    = new Text();

        public void map(LongWritable key, Text inputs, OutputCollector <Text, Text> output, Reporter rptr) throws IOException{
            String line = inputs.toString();
            String [] fields = line.split(",");
            String pubName = formatStats(fields[0]);
            String click   = fields[2];
            String views   = fields[3];
            // ***** send stats to reducer as a string separated by :
            String value   = click+":"+views;

            mapOpValue.set(formatStats(value));
            publisherName.set(pubName);   

            output.collect(publisherName, mapOpValue);
        }

        private String formatStats(String stat) {
            while((stat.indexOf("\"") >= 0) && (stat.indexOf(",")) >= 0){
                stat = stat.replace("\"","");
                stat = stat.replace(",","");
            }
            return stat;
        }
    }

    public static class Reduce extends MapReduceBase implements Reducer< Text, Text, Text, Text >
    {
        private Text pubName = new Text();
        public void reduce(Text key, Iterator<Text> value, OutputCollector<Text, Text> oc, Reporter rptr) throws IOException {
            int views     = 0;
            int clicks    = 0;
            String val    = "";
            String opVal  = "";
            Text textOpVal= new Text();

            while(value.hasNext()){
                val = value.next().toString();

                String [] tokens = val.split(":");

                try {
                    clicks = clicks + Integer.parseInt(tokens[0]);
                    views  = views  + Integer.parseInt(tokens[1]);
                } catch (Exception e) {
                    System.out.println("This is Command HQ, code red\nError Message: "+e.getLocalizedMessage()+" Error class: "+e.getClass()+"Extra, Array length: "+tokens.length);
                }
            }           

            try {
                            // ******* want to separate stats by comma but can't !!
                opVal = Integer.toString(clicks) + ":"+ Integer.toString(views);
            } catch (Exception e) {
                System.out.println("This is Command HQ, code Yellow\nError Message: "+e.getLocalizedMessage()+" Error class: "+e.getClass());
            }
            textOpVal.set(opVal);
            oc.collect(key, textOpVal);     
        }
    }

    public static void main(String [] args) throws Exception {
        JobConf jc = new JobConf(SGSClickViewStats.class);
        jc.setJobName("SGSClickViewStats");
        jc.setOutputKeyClass(Text.class);
        jc.setOutputValueClass(Text.class);

        jc.setMapperClass(Map.class);
        jc.setReducerClass(Reduce.class);
        jc.setCombinerClass(Reduce.class);

        jc.setInputFormat(TextInputFormat.class);
        jc.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(jc, new Path(args[0]));
        FileOutputFormat.setOutputPath(jc, new Path(args[1]));

        JobClient.runJob(jc);
    }
}

This program is working fine, but in output of reducer, I can't have final stats separated by a comma, which is at second comment with * * . 该程序运行良好,但是在reducer的输出中,我无法用逗号分隔最终的统计信息,在第二个注释中使用* * If I do that, all my stats become 0, 0. I get this error when I try to comma separate. 如果这样做,我的所有统计信息将变为0、0。当我尝试用逗号分隔时,会出现此错误。

Error Message: For input string: "50, 75" 
Error class: class java.lang.NumberFormatExceptionExtra, Array length: 1

Array length is length of tokens array in reducer function, as I am sending output from mapper to reducer colon (:) separated tokens should have 2 elements, I see one when I set output of reducer comma separated. 数组长度是reducer函数中令牌数组的长度,因为我正在将输出从mapper发送到reducer冒号(:)分隔的令牌应包含2个元素,当我将reducer逗号的输出设置为分隔时,会看到一个元素。

I have referred many articles, but I couldn't find an answer. 我已经推荐了许多文章,但是找不到答案。 I sincerely hope that someone helps !! 我真诚地希望有人帮助! :) :)

Why are you using your Reducer as combiner? 为什么要使用Reducer作为合并器? By the time you data come to Reduce phase it is already "publisher\\tclicks,views" format I guess that could be causing problem. 当您的数据进入简化阶段时,它已经是“ publisher \\ tclicks,views”格式,我想这可能会引起问题。

Can you comment following line and check? 您可以在下面的行中注释并检查吗?

jc.setCombinerClass(Reduce.class);

NumberFormatException is definitely thown by Integer.parseInt , so your error must be in the first try when you are computing the clicks and views sum. NumberFormatException绝对是Integer.parseInt抛出的,因此在计算点击次数和观看次数总和时,您的错误必须在第一次尝试中。 Check the output passed by the mapper. 检查映射器传递的输出。 I'm pretty sure you are not formatting the strings correctly in the mapper. 我很确定您没有在映射器中正确格式化字符串。

Edit : To make it clear for future readers: the problem was the usage of the Reducer class also as a Combiner by mistake, thus producing a different output than expected from the map phase. 编辑 :为将来的读者明确:问题是将Reducer类也误用作了Combiner,从而产生了与map阶段不同的输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM