简体   繁体   English

为什么Hadoop Map-Reduce应用程序在两个不同的reduce任务中处理相同的数据?

[英]Why is Hadoop Map-Reduce application processing the same data in two different reduce tasks?

I am working on the hadoop map-reduce framework and following the Hadoop- The Definitive guide book. 我正在开发hadoop map-reduce框架,并遵循Hadoop- The Definitive指南。

As specified in book, i have implemented a Map-reduce task, which reads the input file as a whole and deligates the output to a SequenceFileOutputFormat. 如书中所指定,我已经实现了Map-reduce任务,该任务将从整体上读取输入文件,并将输出绑定到SequenceFileOutputFormat。 Here are the classes which i have implemented: 这是我实现的类:

SmallFilesToSequenceFileConverter.java SmallFilesToSequenceFileConverter.java

public class SmallFilesToSequenceFileConverter extends Configured implements Tool {
    static class SequenceFileMapper extends Mapper<NullWritable, BytesWritable, Text, BytesWritable>{
        private Text filenameKey;

        @Override
        protected void setup(Mapper<NullWritable, BytesWritable, Text, BytesWritable>.Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub

            InputSplit split = context.getInputSplit();
            Path path = ((FileSplit)split).getPath();
            filenameKey = new Text(path.getName());

        }

        @Override
        protected void map(NullWritable key, BytesWritable value,
                Mapper<NullWritable, BytesWritable, Text, BytesWritable>.Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            context.write(filenameKey, value);
        }
    }

    public int run(String[] args) throws Exception {
        Job job = new Job(getConf());

        job.setInputFormatClass(WholeFileInputFormat.class);
        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        WholeFileInputFormat.setInputPaths(job, new Path(args[0]));
        SequenceFileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);

        job.setMapperClass(SequenceFileMapper.class);
        job.setNumReduceTasks(2);

        return job.waitForCompletion(true) ? 0 : 1;

    }

    public static void main(String[] args) throws Exception{

        String argg[] = {"/Users/bng/Documents/hadoop/inputFromBook/smallFiles",
        "/Users/bng/Documents/hadoop/output_SmallFilesToSequenceFileConverter"}; 

        int exitcode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), argg);
        System.exit(exitcode);
    }
}

WholeFileInputFormat.java WholeFileInputFormat.java

public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable>{



@Override
    protected boolean isSplitable(JobContext context, Path file) {
        return false;
    }

    @Override
      public RecordReader<NullWritable, BytesWritable> createRecordReader(
          InputSplit split, TaskAttemptContext context) throws IOException,
          InterruptedException {
        WholeFileRecordReader reader = new WholeFileRecordReader();
        reader.initialize(split, context);
        return reader;
      }
}

WholeFileRecordReader.java WholeFileRecordReader.java

public class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable>{
private FileSplit fileSplit;
    private Configuration conf;
    private BytesWritable value = new BytesWritable();
    private boolean processed = false;

    @Override
    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        this.fileSplit = (FileSplit) split;
        this.conf = context.getConfiguration();
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {

        if(!processed){
            byte[] contents = new byte[(int)fileSplit.getLength()];
            Path file = fileSplit.getPath();
            FileSystem fs = file.getFileSystem(conf);
            FSDataInputStream in = null;
            try{
                in = fs.open(file);
                IOUtils.readFully(in, contents, 0, contents.length);
                value.set(contents, 0, contents.length);
            }catch(Exception e){
                e.printStackTrace();
            }finally{
                IOUtils.closeStream(in);
            }
            processed = true;
            return true;
        }
        return false;
    }

    @Override
    public NullWritable getCurrentKey() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return NullWritable.get();
    }

    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return value;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        return processed ? 1.0f : 0.0f;
    }

    @Override
    public void close() throws IOException {
    }

}

As specified here in SmallFilesToSequenceFileConverter.java, when i use a single reduce task, everything works fine and i got the output as expected as follows: 如在SmallFilesToSequenceFileConverter.java中指定的那样,当我使用单个reduce任务时,一切正常,并且我得到了预期的输出,如下所示:

//part-r-00000
SEQorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable������xd[^•MÈÔg…h#Ÿa������a���
aaaaaaaaaa������b���
bbbbbbbbbb������c���
cccccccccc������d���
dddddddddd������dummy���ffffffffff
������e����������f���
ffffffffff

But the problem here is when i use two reduce tasks, i got the output results being processed by both the reduce tasks. 但是这里的问题是当我使用两个reduce任务时,两个reduce任务都在处理输出结果。 In case of two reduce tasks, here is the output. 如果有两个reduce任务,则为输出。

//part-r-00000
SEQorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable������ÓÙE˜xØÏXØâÆU.êÚ������a���
aaaaaaaaaa������b�
bbbbbbbbbb������c
cccccccccc������e����

//part-r-00001
SEQorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable������π¸ú∞8Á8˜lÍx∞:¿������b���
bbbbbbbbbb������d���
dddddddddd������dummy���ffffffffff
������f���
ffffffffff

Which shows that the data "bbbbbbbbbb" is being processed by both reduce tasks. 这表明两个缩减任务都在处理数据“ bbbbbbbbbb”。 What could be the problem here? 这可能是什么问题? Or is it the fine to have this result? 还是得到这个结果好吗? Or any mistake i am making? 还是我犯了任何错误?

For reference, the input directory contains, six input files name a to f, each containing data corresponsding to the file name, eg file named a contain data "aaaaaaaaaaa" and other files contain the similar data except the e file which is empty. 作为参考,输入目录包含六个输入文件名a至f,每个文件包含与文件名相对应的数据,例如,名为a的文件包含数据“ aaaaaaaaaaaaa”,其他文件包含相似的数据,但e文件为空。 And there is a file named dummy, which contains data "ffffffffff". 还有一个名为dummy的文件,其中包含数据“ ffffffffff”。

I don't get the exact reason for this. 我没有确切的原因。

But deleting the name node and data node directory as specified in hdfs-site.xml and restarting the hdfs, yarn and mr services solved the issue for me. 但是删除hdfs-site.xml中指定的名称节点和数据节点目录并重新启动hdfs,yarn和mr服务对我来说解决了这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM