简体繁体 English

次要排序：何时对值进行排序？

[英]Secondary Sorting : When does the sorting of values happen?

原文 2017-11-23 06:49:30 1 1 java/ hadoop/ mapreduce/ bigdata/ reducers

I have implemented secondary sorting for my requirement. 我已根据需要实施了二级排序。 But I need some clarity on the internal working of the same. 但是我需要对它们的内部工作进行一些澄清。

Given that sorting happens on the map side. 鉴于排序发生在地图端。 I assume that all the (k,V) pairs in the spill files are ordered by keys. 我假设溢出文件中的所有（k，V）对都是按键排序的。 In our case, the composite keys. 在我们的例子中，是复合键。

I would like to know how the values belonging the same key from many mapfiles come in a specific order (As specified in the SortComparator) to reduce fucntion every single time. 我想知道来自许多映射文件的属于同一键的值如何以特定顺序（如SortComparator中指定的）出现，以减少每次的功能。

If sorting happens on the map side and merging is done on the reducer side. 如果排序发生在地图侧，并且合并在简化器侧进行。 How and when the values belonging to key from many map files are arranged in a particular order before the reduce function starts ? 在reduce函数启动之前，如何以及何时将许多映射文件中属于key的值按特定顺序排列？

1 个解决方案

Values are not sorted by default, only the keys. 默认情况下，不对值进行排序，仅对键进行排序。 However, you can override Partitioner , SortComparator and GroupingComparator in specific way that makes hadoop framework to sort both keys and values in any way you like. 但是，您可以通过特定方式覆盖Partitioner ， SortComparator和GroupingComparator ，从而使hadoop框架可以按您喜欢的任何方式对键和值进行排序。 (Example of such setup can be found in my article ). （这种设置的示例可以在我的文章中找到）。 Beware that because typically value objects are much larger, jobs which order both keys and values will run for much longer than with only keys sorting. 请注意，由于通常值对象要大得多，因此对键和值进行排序的作业将比仅按键排序运行的时间更长。

Keys are sorted in both mappers and reducers: 键在映射器和精简器中都被排序：

mappers sort KV pairs for every reducer output resulting in that each output file is sorted according to SortComparator 映射器对每个reducer输出的KV对进行排序，从而导致每个输出文件根据SortComparator进行排序
reducer takes a lot of sorted files from mappers and merges them together, providing input to reduce() invocations reducer从映射器中获取大量已排序的文件，并将它们合并在一起，从而为reduce（）调用提供输入

Values come to reduce() in some unspecified order by default. 默认情况下，值以某些未指定的顺序来减少（）。 Generally, it will depend on everything: order in which you emit key/value pairs in map, order in which hadoop decides to merge files, sort algorithm used and so on. 通常，它取决于一切：在map中发出键/值对的顺序，hadoop决定合并文件的顺序，使用的排序算法等。