简体   繁体   English

如何在hadoop map reduce中写avro输出?

[英]How to write avro output in hadoop map reduce?

I wrote one Hadoop word count program which takes TextInputFormat input and is supposed to output word count in avro format. 我编写了一个Hadoop字数统计程序, TextInputFormat程序采用TextInputFormat输入,并且应该以avro格式输出字数统计。

Map-Reduce job is running fine but output of this job is readable using unix commands such as more or vi . Map-Reduce作业运行良好,但是使用unix命令(例如morevi可以读取此作业的输出。 I was expecting this output be unreadable as avro output is in binary format. 我期待此输出不可读,因为avro输出为二进制格式。

I have used mapper only, reducer is not present. 我仅使用了映射器,不存在reducer。 I just want to experiment with avro so I am not worried about memory or stack overflow. 我只想尝试avro,所以我不担心内存或堆栈溢出。 Following the the code of mapper 遵循映射器的代码

public class WordCountMapper extends Mapper<LongWritable, Text, AvroKey<String>, AvroValue<Integer>> {

    private Map<String, Integer> wordCountMap = new HashMap<String, Integer>();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] keys = value.toString().split("[\\s-*,\":]");
        for (String currentKey : keys) {
            int currentCount = 1;
            String currentToken = currentKey.trim().toLowerCase();
            if(wordCountMap.containsKey(currentToken)) {
                currentCount = wordCountMap.get(currentToken);
                currentCount++;
            }
            wordCountMap.put(currentToken, currentCount);
        }
        System.out.println("DEBUG : total number of unique words = " + wordCountMap.size());
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        for (Map.Entry<String, Integer> currentKeyValue : wordCountMap.entrySet()) {
            AvroKey<String> currentKey = new AvroKey<String>(currentKeyValue.getKey());
            AvroValue<Integer> currentValue = new AvroValue<Integer>(currentKeyValue.getValue());
            context.write(currentKey, currentValue);
        }
    }
}

and driver code is as follows : 驱动代码如下:

public int run(String[] args) throws Exception {

    Job avroJob = new Job(getConf());
    avroJob.setJarByClass(AvroWordCount.class);
    avroJob.setJobName("Avro word count");

    avroJob.setInputFormatClass(TextInputFormat.class);
    avroJob.setMapperClass(WordCountMapper.class);

    AvroJob.setInputKeySchema(avroJob, Schema.create(Type.INT));
    AvroJob.setInputValueSchema(avroJob, Schema.create(Type.STRING));

    AvroJob.setMapOutputKeySchema(avroJob, Schema.create(Type.STRING));
    AvroJob.setMapOutputValueSchema(avroJob, Schema.create(Type.INT));

    AvroJob.setOutputKeySchema(avroJob, Schema.create(Type.STRING));
    AvroJob.setOutputValueSchema(avroJob, Schema.create(Type.INT));


    FileInputFormat.addInputPath(avroJob, new Path(args[0]));
    FileOutputFormat.setOutputPath(avroJob, new Path(args[1]));

    return avroJob.waitForCompletion(true) ? 0 : 1;
}

I would like to know how do avro output looks like and what am I doing wrong in this program. 我想知道avro输出的外观以及我在此程序中做错了什么。

Latest release of Avro library includes an updated example of their ColorCount example adopted for MRv2. Avro库的最新版本包括其MRv2采用的ColorCount示例的更新示例 I suggest you to look at it, use the same pattern as they use in Reduce class or just extend AvroMapper . 我建议您看一下,使用与Reduce类中相同的模式,或者只是扩展AvroMapper Please note that using Pair class instead of AvroKey+AvroValue is also essential for running Avro on Hadoop. 请注意,使用Pair类代替AvroKey + AvroValue对于在Hadoop上运行Avro也是必不可少的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM