简体   繁体   中英

How to format the output being written by Mapreduce in Hadoop

I am trying to reverse the contents of the file by each word. I have the program running fine, but the output i am getting is something like this

1   dwp
2   seviG
3   eht
4   tnerruc
5   gnikdrow
6   yrotcerid
7   ridkm
8   desU
9   ot
10  etaerc

I want the output to be something like this

dwp seviG eht tnerruc gnikdrow yrotcerid ridkm desU
ot etaerc

The code i am working with

    import java.io.IOException;
    import java.util.*;

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    import org.apache.hadoop.util.*;

    public class Reproduce {

    public static int temp =0;
    public static class ReproduceMap extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text>{
        private Text word = new Text();
        @Override
        public void map(LongWritable arg0, Text value,
                OutputCollector<IntWritable, Text> output, Reporter reporter)
                throws IOException {
            String line = value.toString().concat("\n");
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(new StringBuffer(tokenizer.nextToken()).reverse().toString());
                temp++;
                output.collect(new IntWritable(temp),word);
              }

        }

    }

    public static class ReproduceReduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>{

        @Override
        public void reduce(IntWritable arg0, Iterator<Text> arg1,
                OutputCollector<IntWritable, Text> arg2, Reporter arg3)
                throws IOException {
            String word = arg1.next().toString();
            Text word1 = new Text();
            word1.set(word);
            arg2.collect(arg0, word1);

        }

    }

    public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(IntWritable.class);
    conf.setOutputValueClass(Text.class);

    conf.setMapperClass(ReproduceMap.class);
    conf.setReducerClass(ReproduceReduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);

  }
}

How do i modify my output instead of writing another java program to do that

Thanks in advance

Here is a simple code demonstrate the use of custom FileoutputFormat

public class MyTextOutputFormat extends FileOutputFormat<Text, List<IntWritable>> {
      @Override
      public org.apache.hadoop.mapreduce.RecordWriter<Text, List<Intwritable>> getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
         //get the current path
         Path path = FileOutputFormat.getOutputPath(arg0);
         //create the full path with the output directory plus our filename
         Path fullPath = new Path(path, "result.txt");
     //create the file in the file system
     FileSystem fs = path.getFileSystem(arg0.getConfiguration());
     FSDataOutputStream fileOut = fs.create(fullPath, arg0);

     //create our record writer with the new file
     return new MyCustomRecordWriter(fileOut);
  }
}

public class MyCustomRecordWriter extends RecordWriter<Text, List<IntWritable>> {
    private DataOutputStream out;

    public MyCustomRecordWriter(DataOutputStream stream) {
        out = stream;
        try {
            out.writeBytes("results:\r\n");
        }
        catch (Exception ex) {
        }  
    }

    @Override
    public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
        //close our file
        out.close();
    }

    @Override
    public void write(Text arg0, List arg1) throws IOException, InterruptedException {
        //write out our key
        out.writeBytes(arg0.toString() + ": ");
        //loop through all values associated with our key and write them with commas between
        for (int i=0; i<arg1.size(); i++) {
            if (i>0)
                out.writeBytes(",");
            out.writeBytes(String.valueOf(arg1.get(i)));
        }
        out.writeBytes("\r\n");  
    }
}

Finally we need to tell our job about our ouput format and the path before running it.

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(ArrayList.class);
job.setOutputFormatClass(MyTextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/out"));

我们可以通过编写自定义文件outputformat类来自定义输出

you can use NullWritable as a output value. NullWritable is just a placeholder Since you don't want number to be displayed as a part of your output. I have given modified reducer class. Note :- need to add import statement for NullWritable

public static class ReproduceReduce extends MapReduceBase implements Reducer<IntWritable, Text,  Text, NullWritable>{

            @Override
            public void reduce(IntWritable arg0, Iterator<Text> arg1,
                    OutputCollector<Text, NullWritable> arg2, Reporter arg3)
                    throws IOException {
                String word = arg1.next().toString();
                Text word1 = new Text();
                word1.set(word);
                arg2.collect(word1, new NullWritable());

            }

        }

and change the driver class or main method

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(NullWritable.class);

In Mapper key temp is incremented for each word value, So each word is processed as a separate key-value pair.

Below steps should solve the problem 1) In Mapper just remove the temp++, so that all the reversed words will have the key as 0 (temp =0).

2) Reducer receives the key 0 and list of reversed strings. In reducer set the key to NullWritable and write the output.

What you can try is take one constant key (or simply nullwritable) and pass this as a key and your complete line as a value(you can reverse it in mapper class or you can also reverse it in the reducer class as well). so your reducer will receive a constant key (or place holder if you have used nullwritable as a key) and complete line. Now you can simply reverse the line and write it to output file. By not using tmp as a key you avoid writing unwanted numbers in your output file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM