简体   繁体   English

MapReduce作业:奇怪的输出?

[英]MapReduce job: weird output?

I'm writing my first MapReduce job. 我正在写我的第一份MapReduce工作。 Something simple: just counting alphanumeric characters from a file. 简单的事情:只计算文件中的字母数字字符。 I've accomplished to generate my jar file and run it, but I can't find the output of the MR job, apart of the debugging output. 我已经完成了生成我的jar文件并运行它的工作,但是除了调试输出之外,我找不到MR作业的输出。 Could you please help me? 请你帮助我好吗?

My application class: 我的应用程序类:

import CharacterCountMapper;
import CharacterCountReducer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class CharacterCountDriver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {

        // Create a JobConf using the processed configuration processed by ToolRunner
        Job job = Job.getInstance(getConf());

        // Process custom command-line options
        Path in = new Path("/tmp/filein");
        Path out = new Path("/tmp/fileout");

        // Specify various job-specific parameters     
        job.setJobName("Character-Count");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(CharacterCountMapper.class);
        job.setReducerClass(CharacterCountReducer.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, in);
        FileOutputFormat.setOutputPath(job, out);

        job.setJarByClass(CharacterCountDriver.class);

        job.submit();
        return 0;
    }

    public static void main(String[] args) throws Exception {
        // Let ToolRunner handle generic command-line options 
        int res = ToolRunner.run(new Configuration(), new CharacterCountDriver(), args);

        System.exit(res);
      }
}

Then my mapper class: 然后我的映射器类:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class CharacterCountMapper extends
        Mapper<Object, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);

    @Override
    protected void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
        String strValue = value.toString();
        StringTokenizer chars = new StringTokenizer(strValue.replaceAll("[^a-zA-Z0-9]", ""));
        while (chars.hasMoreTokens()) {
            context.write(new Text(chars.nextToken()), one);
        }
    }
}

And the reducer: 减速器:

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class CharacterCountReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int charCount = 0;
        for (IntWritable val: values) {
            charCount += val.get();
        }
        context.write(key, new IntWritable(charCount));
    }
}

It looks nice, I generate the runnable jar file from my IDE and execute it as follows: 看起来不错,我从IDE生成了可运行的jar文件,并按如下所示执行它:

$ ./hadoop jar ~/Desktop/example_MapReduce.jar no.hib.mod250.hadoop.CharacterCountDriver
14/11/27 19:36:42 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/27 19:36:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/27 19:36:42 INFO input.FileInputFormat: Total input paths to process : 1
14/11/27 19:36:42 INFO mapreduce.JobSubmitter: number of splits:1
14/11/27 19:36:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local316715466_0001
14/11/27 19:36:43 WARN conf.Configuration: file:/tmp/hadoop-roberto/mapred/staging/roberto316715466/.staging/job_local316715466_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/11/27 19:36:43 WARN conf.Configuration: file:/tmp/hadoop-roberto/mapred/staging/roberto316715466/.staging/job_local316715466_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
14/11/27 19:36:43 WARN conf.Configuration: file:/tmp/hadoop-roberto/mapred/local/localRunner/roberto/job_local316715466_0001/job_local316715466_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/11/27 19:36:43 WARN conf.Configuration: file:/tmp/hadoop-roberto/mapred/local/localRunner/roberto/job_local316715466_0001/job_local316715466_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
14/11/27 19:36:43 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
14/11/27 19:36:43 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/11/27 19:36:43 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14/11/27 19:36:43 INFO mapred.LocalJobRunner: Waiting for map tasks
14/11/27 19:36:43 INFO mapred.LocalJobRunner: Starting task: attempt_local316715466_0001_m_000000_0
14/11/27 19:36:43 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
14/11/27 19:36:43 INFO mapred.MapTask: Processing split: file:/tmp/filein:0+434
14/11/27 19:36:43 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer

Then I guess that my output file will be in /tmp/fileout. 然后我猜我的输出文件将在/ tmp / fileout中。 But instead, it seems empty: 但是,它似乎是空的:

$ tree /tmp/fileout/
/tmp/fileout/
└── _temporary
    └── 0

2 directories, 0 files

Is there anything which I'm missing? 有什么我想念的吗? Can anyone help me out? 谁能帮我吗?

Regards :-) 问候 :-)

Edit: 编辑:

I almost found a solution on this other post . 我几乎在另一篇文章中找到了解决方案。

Within CharacterCountDriver, I substituted job.submit() by job.waitForCompletion(true). 在CharacterCountDriver中,我用job.waitForCompletion(true)替换了job.submit()。 I'm getting a more verbose output: 我得到了更详细的输出:

/tmp/fileout/
├── part-r-00000
└── _SUCCESS

0 directories, 2 files

But I still don't know how to read those, _SUCCESS is empty and part-r-0000 is not what I was expecting: 但是我仍然不知道该怎么读,_SUCCESS是空的,part-r-0000不是我所期望的:

Absorbantandyellowandporousishe 1
AreyoureadykidsAyeAyeCaptain    1
ICanthearyouAYEAYECAPTAIN       1
Ifnauticalnonsensebesomethingyouwish    1
Ohh     1
READY   1
SPONGEBOBSQUAREPANTS    1
SpongebobSquarepants    3
Spongebobsquarepants    4
Thendroponthedeckandfloplikeafish       1
Wholivesinapineappleunderthesea 1

Any advice? 有什么建议吗? Is there maybe any mistake in my code? 我的代码中可能有任何错误吗? Thanks. 谢谢。

part-r-00000 is the name of your reducer output file. part-r-00000是减速器输出文件的名称。 If you have more reducers they would be numbered part-r-00001 and so on. 如果您有更多的减速器,它们将被编号为r-00001,以此类推。

If I understand correctly, you want your program to count the alphanumeric characters in the input file(s). 如果我理解正确,则您希望程序对输入文件中的字母数字字符计数。 However, this is NOT what your code is doing. 但是,这不是您的代码在做什么。 You can change your mapper to count the alphanumeric characters in each line: 您可以更改映射器以计算每行中的字母数字字符:

String strValue = value.toString();
strValue.replaceAll("[^a-zA-Z0-9]", "");
context.write(new Text("alphanumeric", strValue.length());

This should fix your program. 这应该可以修复您的程序。 Basically, your mappers are outputting the alphanumeric characters in each line as the key. 基本上,您的映射器将每行中的字母数字字符作为键输出。 The reducer accumulates the counts per key. 减速器累积每个键的计数。 With my change, you only use one key: "alphanumeric". 进行我的更改后,您只需使用一个键:“字母数字”。 The key could be something else, and it would still work. 密钥可能是其他东西,并且仍然可以使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM