简体   繁体   English

如何格式化Hadoop中Mapreduce编写的输出

[英]How to format the output being written by Mapreduce in Hadoop

I am trying to reverse the contents of the file by each word. 我正在尝试按每个字反转文件的内容。 I have the program running fine, but the output i am getting is something like this 我的程序运行正常,但是我得到的输出是这样的

1   dwp
2   seviG
3   eht
4   tnerruc
5   gnikdrow
6   yrotcerid
7   ridkm
8   desU
9   ot
10  etaerc

I want the output to be something like this 我希望输出是这样的

dwp seviG eht tnerruc gnikdrow yrotcerid ridkm desU
ot etaerc

The code i am working with 我正在使用的代码

    import java.io.IOException;
    import java.util.*;

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    import org.apache.hadoop.util.*;

    public class Reproduce {

    public static int temp =0;
    public static class ReproduceMap extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text>{
        private Text word = new Text();
        @Override
        public void map(LongWritable arg0, Text value,
                OutputCollector<IntWritable, Text> output, Reporter reporter)
                throws IOException {
            String line = value.toString().concat("\n");
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(new StringBuffer(tokenizer.nextToken()).reverse().toString());
                temp++;
                output.collect(new IntWritable(temp),word);
              }

        }

    }

    public static class ReproduceReduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>{

        @Override
        public void reduce(IntWritable arg0, Iterator<Text> arg1,
                OutputCollector<IntWritable, Text> arg2, Reporter arg3)
                throws IOException {
            String word = arg1.next().toString();
            Text word1 = new Text();
            word1.set(word);
            arg2.collect(arg0, word1);

        }

    }

    public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(IntWritable.class);
    conf.setOutputValueClass(Text.class);

    conf.setMapperClass(ReproduceMap.class);
    conf.setReducerClass(ReproduceReduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);

  }
}

How do i modify my output instead of writing another java program to do that 我如何修改我的输出而不是编写另一个Java程序来做到这一点

Thanks in advance 提前致谢

Here is a simple code demonstrate the use of custom FileoutputFormat 这是一个简单的代码,演示自定义FileoutputFormat的使用

public class MyTextOutputFormat extends FileOutputFormat<Text, List<IntWritable>> {
      @Override
      public org.apache.hadoop.mapreduce.RecordWriter<Text, List<Intwritable>> getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
         //get the current path
         Path path = FileOutputFormat.getOutputPath(arg0);
         //create the full path with the output directory plus our filename
         Path fullPath = new Path(path, "result.txt");
     //create the file in the file system
     FileSystem fs = path.getFileSystem(arg0.getConfiguration());
     FSDataOutputStream fileOut = fs.create(fullPath, arg0);

     //create our record writer with the new file
     return new MyCustomRecordWriter(fileOut);
  }
}

public class MyCustomRecordWriter extends RecordWriter<Text, List<IntWritable>> {
    private DataOutputStream out;

    public MyCustomRecordWriter(DataOutputStream stream) {
        out = stream;
        try {
            out.writeBytes("results:\r\n");
        }
        catch (Exception ex) {
        }  
    }

    @Override
    public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
        //close our file
        out.close();
    }

    @Override
    public void write(Text arg0, List arg1) throws IOException, InterruptedException {
        //write out our key
        out.writeBytes(arg0.toString() + ": ");
        //loop through all values associated with our key and write them with commas between
        for (int i=0; i<arg1.size(); i++) {
            if (i>0)
                out.writeBytes(",");
            out.writeBytes(String.valueOf(arg1.get(i)));
        }
        out.writeBytes("\r\n");  
    }
}

Finally we need to tell our job about our ouput format and the path before running it. 最后,在运行输出格式之前,我们需要告知我们的工作。

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(ArrayList.class);
job.setOutputFormatClass(MyTextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/out"));

我们可以通过编写自定义文件outputformat类来自定义输出

you can use NullWritable as a output value. 您可以将NullWritable用作输出值。 NullWritable is just a placeholder Since you don't want number to be displayed as a part of your output. NullWritable只是一个占位符,因为您不希望数字显示在输出中。 I have given modified reducer class. 我给了修改的减速器类。 Note :- need to add import statement for NullWritable 注意:-需要为NullWritable添加import语句

public static class ReproduceReduce extends MapReduceBase implements Reducer<IntWritable, Text,  Text, NullWritable>{

            @Override
            public void reduce(IntWritable arg0, Iterator<Text> arg1,
                    OutputCollector<Text, NullWritable> arg2, Reporter arg3)
                    throws IOException {
                String word = arg1.next().toString();
                Text word1 = new Text();
                word1.set(word);
                arg2.collect(word1, new NullWritable());

            }

        }

and change the driver class or main method 并更改驱动程序类或主要方法

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(NullWritable.class);

In Mapper key temp is incremented for each word value, So each word is processed as a separate key-value pair. 在Mapper中,对于每个单词值,键温度都会增加,因此,每个单词都将作为单独的键值对进行处理。

Below steps should solve the problem 1) In Mapper just remove the temp++, so that all the reversed words will have the key as 0 (temp =0). 下面的步骤应该可以解决问题:1)在Mapper中,只需删除temp ++,以便所有反向单词的键都为0(temp = 0)。

2) Reducer receives the key 0 and list of reversed strings. 2)Reducer接收键0和反向字符串列表。 In reducer set the key to NullWritable and write the output. 在reducer中,将键设置为NullWritable并写入输出。

What you can try is take one constant key (or simply nullwritable) and pass this as a key and your complete line as a value(you can reverse it in mapper class or you can also reverse it in the reducer class as well). 您可以尝试使用一个常数键(或简单地为null可写)并将其作为键,并将完整的行作为值传递(您可以在mapper类中将其反转,也可以在reducer类中将其反转)。 so your reducer will receive a constant key (or place holder if you have used nullwritable as a key) and complete line. 因此,Reducer将收到一个常数键(如果您已使用nullwritable作为键,则将获得一个占位符)和完整行。 Now you can simply reverse the line and write it to output file. 现在,您可以简单地反转行并将其写入输出文件。 By not using tmp as a key you avoid writing unwanted numbers in your output file. 通过不使用tmp作为键,可以避免在输出文件中写入不需要的数字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM