[英]Weird formatting issue in hadoop map/reduce program in java
I have a csv file with following sample records. 我有以下示例记录的csv文件。
| publisher | site | ad clicks | ad views |
|============|====================|===========|==========|
| publisher1 | www.sampleSite.com | 50 | 75 |
| publisher1 | www.sampleSite2.com| 10 | 40 |
| publisher2 | www.newSite1.com | 100 | 175 |
| publisher2 | www.newSite2.com | 50 | 65 |
Using map/reduce in java, I am trying to sum all ad clicks and ad views for every publisher. 我在Java中使用map / reduce,试图对每个发布者的所有广告点击次数和广告观看次数进行汇总。 So output should be like this
所以输出应该是这样的
publisher1 60, 115
publisher2 150, 240
I have written following code. 我写了下面的代码。
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class SGSClickViewStats
{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>
{
int recNo = 0;
private Text publisherName = new Text();
private Text mapOpValue = new Text();
public void map(LongWritable key, Text inputs, OutputCollector <Text, Text> output, Reporter rptr) throws IOException{
String line = inputs.toString();
String [] fields = line.split(",");
String pubName = formatStats(fields[0]);
String click = fields[2];
String views = fields[3];
// ***** send stats to reducer as a string separated by :
String value = click+":"+views;
mapOpValue.set(formatStats(value));
publisherName.set(pubName);
output.collect(publisherName, mapOpValue);
}
private String formatStats(String stat) {
while((stat.indexOf("\"") >= 0) && (stat.indexOf(",")) >= 0){
stat = stat.replace("\"","");
stat = stat.replace(",","");
}
return stat;
}
}
public static class Reduce extends MapReduceBase implements Reducer< Text, Text, Text, Text >
{
private Text pubName = new Text();
public void reduce(Text key, Iterator<Text> value, OutputCollector<Text, Text> oc, Reporter rptr) throws IOException {
int views = 0;
int clicks = 0;
String val = "";
String opVal = "";
Text textOpVal= new Text();
while(value.hasNext()){
val = value.next().toString();
String [] tokens = val.split(":");
try {
clicks = clicks + Integer.parseInt(tokens[0]);
views = views + Integer.parseInt(tokens[1]);
} catch (Exception e) {
System.out.println("This is Command HQ, code red\nError Message: "+e.getLocalizedMessage()+" Error class: "+e.getClass()+"Extra, Array length: "+tokens.length);
}
}
try {
// ******* want to separate stats by comma but can't !!
opVal = Integer.toString(clicks) + ":"+ Integer.toString(views);
} catch (Exception e) {
System.out.println("This is Command HQ, code Yellow\nError Message: "+e.getLocalizedMessage()+" Error class: "+e.getClass());
}
textOpVal.set(opVal);
oc.collect(key, textOpVal);
}
}
public static void main(String [] args) throws Exception {
JobConf jc = new JobConf(SGSClickViewStats.class);
jc.setJobName("SGSClickViewStats");
jc.setOutputKeyClass(Text.class);
jc.setOutputValueClass(Text.class);
jc.setMapperClass(Map.class);
jc.setReducerClass(Reduce.class);
jc.setCombinerClass(Reduce.class);
jc.setInputFormat(TextInputFormat.class);
jc.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(jc, new Path(args[0]));
FileOutputFormat.setOutputPath(jc, new Path(args[1]));
JobClient.runJob(jc);
}
}
This program is working fine, but in output of reducer, I can't have final stats separated by a comma, which is at second comment with * * . 该程序运行良好,但是在reducer的输出中,我无法用逗号分隔最终的统计信息,在第二个注释中使用* * 。 If I do that, all my stats become 0, 0. I get this error when I try to comma separate.
如果这样做,我的所有统计信息将变为0、0。当我尝试用逗号分隔时,会出现此错误。
Error Message: For input string: "50, 75"
Error class: class java.lang.NumberFormatExceptionExtra, Array length: 1
Array length is length of tokens array in reducer function, as I am sending output from mapper to reducer colon (:) separated tokens should have 2 elements, I see one when I set output of reducer comma separated. 数组长度是reducer函数中令牌数组的长度,因为我正在将输出从mapper发送到reducer冒号(:)分隔的令牌应包含2个元素,当我将reducer逗号的输出设置为分隔时,会看到一个元素。
I have referred many articles, but I couldn't find an answer. 我已经推荐了许多文章,但是找不到答案。 I sincerely hope that someone helps !!
我真诚地希望有人帮助! :)
:)
Why are you using your Reducer as combiner? 为什么要使用Reducer作为合并器? By the time you data come to Reduce phase it is already "publisher\\tclicks,views" format I guess that could be causing problem.
当您的数据进入简化阶段时,它已经是“ publisher \\ tclicks,views”格式,我想这可能会引起问题。
Can you comment following line and check? 您可以在下面的行中注释并检查吗?
jc.setCombinerClass(Reduce.class);
NumberFormatException
is definitely thown by Integer.parseInt
, so your error must be in the first try when you are computing the clicks and views sum. NumberFormatException
绝对是Integer.parseInt
抛出的,因此在计算点击次数和观看次数总和时,您的错误必须在第一次尝试中。 Check the output passed by the mapper. 检查映射器传递的输出。 I'm pretty sure you are not formatting the strings correctly in the mapper.
我很确定您没有在映射器中正确格式化字符串。
Edit : To make it clear for future readers: the problem was the usage of the Reducer class also as a Combiner by mistake, thus producing a different output than expected from the map phase. 编辑 :为将来的读者明确:问题是将Reducer类也误用作了Combiner,从而产生了与map阶段不同的输出。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.