简体   繁体   English

Map-Reduce中的二级排序

[英]Secondary sorting in Map-Reduce

I understood the way of sorting the values of a particular key before the key enters the reducer. 我了解在密钥进入化简器之前对特定密钥的值进行排序的方式。 I learned that it can be done by writing three methods viz, keycomparator, partitioner and valuegrouping. 我了解到,可以通过编写三种方法来完成此工作,即键比较器,分区器和值分组。

Now, when valuegrouping runs, it basically groups all the values associated with the natural key, right? 现在,当值分组运行时,它基本上将与自然键关联的所有值分组,对吗? So when it groups all the values for the natural key, what will be the actual key that is sent along with a set of sorted values to the reducer? 因此,当它对自然键的所有值进行分组时,将与一组排序后的值一起发送给reducer的实际键是什么? The natural key would have been associated with more than one type of entity (the second part of the composite key). 自然键将与一种以上类型的实体(复合键的第二部分)相关联。 What will be the composite key sent to the reducer? 组合密钥将发送到减速器什么?

ap 美联社

This may be surprising to know, but each iteration of the values Iterable actually updates the key reference too: 这可能令人惊讶,但是值Iterable的每次迭代实际上也更新了键引用:

protected void reduce(K key, Iterable<V> values, Context context) {
    for (V value : values) {
        // key object contents will update for each iteration of this loop
    }
}

I know this works for the new mapreduce API, i haven't traced it for the old mapred API. 我知道这适用于新的mapreduce API,但我没有为旧的mapred API追踪它。

So in answer to your question, all the keys will be available, the first key will relate to the first sorted key of the group. 因此,在回答您的问题时,所有键都将可用,第一个键将与组中的第一个排序键相关。

EDIT : Some additional information as to how and why this works: 编辑 :有关如何以及为什么这样工作的一些其他信息:

There are two comparators that the reducer uses to process the key/value pairs output by the map stage: 归约器使用两个比较器来处理map阶段输出的键/值对:

  • the key ordering comparator - This comparator is applied first and orders all the KV pairs. 密钥排序比较器-首先应用此比较器,并对所有KV对进行排序。 Conceptually you are still dealing with the serialized bytes at this stage. 从概念上讲,您在此阶段仍在处理序列化的字节。
  • the key group comparator - This comparator is responsible for determining when the previous and current key 'differ', denoting the boundary between one group of KV pairs and another 密钥组比较器-该比较器负责确定上一个密钥和当前密钥何时“不同”,表示一组KV对与另一对KV对之间的边界

Under the hood, the reference to the key and value never changes, each call to Iterable.Iterator.next() advances the pointer in the underlying byte stream to the next KV pair. 在幕后,对键和值的引用永远不会改变,每次对Iterable.Iterator.next()的调用都会将基础字节流中的指针前进到下一个KV对。 If the key grouper determines that the current set of keys bytes and previous set are comparatively the same key, then the hasNext method of the value Iterable.iterator() will return true, otherwise false. 如果密钥分组程序确定当前密钥字节集和先前的密钥字节集是相对相同的密钥,则Iterable.iterator()值的hasNext方法将返回true,否则返回false。 If true is returned, the bytes are deserialized into the Key and Value instances for consumption in your reduce method. 如果返回true,则将字节反序列化为Key和Value实例,以供您的reduce方法使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM