简体   繁体   English

Mapper组合器修补程序排序/排序

[英]Order of Mapper Combiner patitioner shuffle/sort

I have the below text in Definite Guide: Hadoop in pg 206. 我在第206页的《定额指南:Hadoop》中有以下内容。

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. 在将数据写入磁盘之前,线程首先将数据划分为与最终将要发送到这些约化器的分区。 Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. 在每个分区中,后台线程通过键执行内存中排序,如果有组合器功能,它将在排序的输出上运行。 Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer. 运行组合器功能可实现更紧凑的映射输出,因此更少的数据可写入本地磁盘并传输到reducer。

So with this understanding, Can I sort the order as Mapper, partitioner, shuffle/sort, Combiner? 因此,有了这种理解,我可以将顺序排序为Mapper,分区器,随机播放/排序,Combiner吗?

I've written a good article about this: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/ In general you are right, but in particular there are much more corner cases - combiner might be omitted for some of the records, for some of them it might run many times, and it is even so that combiner might be started on reduce side before the reducer. 我为此写了一篇很好的文章:http: //0x0fff.com/hadoop-mapreduce-comprehensive-description/一般来说,您是对的,但特别是还有很多其他情况-某些情况下可能会省略合并器记录,对于其中一些记录可能会运行很多次,甚至可以使合并器在reducer之前在reduce端启动。 So you are right in general, but the things are much more complex 所以您总体上是正确的,但是事情要复杂得多

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM