简体   繁体   English

hadoop NaturalKeyGroupingComparator - Reducer中发生了什么?

[英]hadoop NaturalKeyGroupingComparator - What's happening in the Reducer?

I'm currently working on a Java EMR project where my key is composed out of 2 Texts. 我目前正在开发一个Java EMR项目,其中我的密钥由2个文本组成。 I set the NaturalKeyGroupingComparator in one of my steps to compare only the left part of key. 我在我的一个步骤中设置了NaturalKeyGroupingComparator,仅比较键的左侧部分。

Now this is the Java code for the Reducer: 现在这是Reducer的Java代码:

     public void reduce(Pair key, Iterable<Data> values, Context context) throws IOException,  InterruptedException{

         int totalOccurrences=0;
         for (Data value : values){

             if (key.getRight().toString().equals("*")){
                 totalOccurrences+=value.getOccurrences();
             }
             else{
                value.setCount(new IntWritable(totalOccurrences));
             }
         }

     }

Now everything is working perfectly fine as planned, but I don't understand what's exactly happening. 现在一切都按计划完美地完成了,但我不明白究竟发生了什么。 How can the key change in the middle of the reduce run? 如何在reduce运行中改变键?

Your question is a good beginners question :) 你的问题是一个很好的初学者问题:)

I have written about it here . 在这里写过这篇文章

I guess the biggest thing to keep in mind is that the Iterable is not backed by a collection , it is computed on the fly as and when the next() method is invoked. 我想要记住的最重要的事情是Iterable 不是由集合支持的 ,它是在调用next()方法时动态计算的。 Just keep this in mind. 请记住这一点。

Once your done with above post if your the i want to see the code kind of person. 一旦你完成上面的帖子,如果你想看到代码类的人。

// Line number157 //行号157

 if (hasMore) {
      nextKey = input.getKey();
      nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0, 
                                     currentRawKey.getLength(),
                                     nextKey.getData(),
                                     nextKey.getPosition(),
                                     nextKey.getLength() - nextKey.getPosition()
                                         ) == 0;
    } else {
      nextKeyIsSame = false;
    }

This is a snippet from ReduceContextImpl 这是ReduceContextImpl的一个片段

The method gets called each time, you invoke next(), it basically checks if the key is changing in the underlying stream if not it just passes you the next value (remember keys are ordered), else it makes arrangements to call the reducer method again with a new key and iterable. 每次调用该方法,你调用next(),它基本上检查键是否在底层流中发生变化,如果不是它只是传递给你下一个值(记住键是有序的),否则它会安排调用reducer方法再次使用新密钥和可迭代。

The underlying stream is always a key, value pair, the ReducerContextImpl gives you an illusion/abstraction of it being a key,Collection pair. 底层流始终是一个键值对,ReducerContextImpl为您提供了一个关键的Collection对的幻觉/抽象。

Like i said at the start .... 就像我在开始时所说的......

The biggest thing to keep in mind is that the Iterable is not backed by a collection , it is computed on the fly as and when the next() method is invoked. 要记住的最重要的事情是Iterable 不是由集合支持的 ,它是在调用next()方法时动态计算的。 Just keep this in mind. 请记住这一点。

This theme is common across the MapReduce framework, all computations are done on streams nothing is ever loaded entirely in the memory, it took me a while to get this :) hence the eagerness to share it. 这个主题在MapReduce框架中很常见,所有的计算都是在流上完成的,没有任何内容完全被加载到内存中,我需要一段时间才能得到这个:)因此渴望分享它。

reduce() method is executed for every key group in the input to the reducer. 对reducer输入中的每个键组执行reduce()方法。 In your case when multiple texts were used as part of the key, keys were grouped using both the texts as key so your output would be 在您将多个文本用作密钥的一部分的情况下,使用文本作为密钥对密钥进行分组,以便输出

KeyGroup1, count1 KeyGroup1,count1

KeyGroup2, count2 KeyGroup2,count2

Now when the grouping is changed based only on the left part of the key, grouping for the reducer also changes providing an output of 现在,当仅基于键的左侧部分更改分组时,缩减器的分组也会更改,从而提供输出

 NewKeyGroup1, count1
 NewKeyGroup2, count2

For deeper understanding go through the Definitive Guide Chapter 8, Section on Secondary Sort 有关更深入的了解,请参阅权威指南第8章“次要排序”一节

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM