简体   繁体   English

Hadoop 文本组合器 Class

[英]Hadoop Combiner Class for Text

I'm still trying to get an intuition as to when to use the Hadoop combiner class (I saw a few articles but they did not specifically help in my situation).我仍在尝试了解何时使用 Hadoop 组合器 class(我看过几篇文章,但它们对我的情况没有特别帮助)。

My question is, is it appropriate to use a combiner class when the value of the pair is of the Text class?我的问题是,当对的值是文本 class 时,使用组合器 class 是否合适? For instance, let's say we have the following output from the mapper:例如,假设我们从映射器中获得以下 output:

fruit apple
fruit orange
fruit banana
...
veggie carrot
veggie celery
...

Can we apply a combiner class here to be:我们可以在这里应用组合器 class 成为:

fruit apple orange banana
...
veggie carrot celery
...

before it even reaches the reducer?在它到达减速器之前?

Combiners are typically suited to a problem where you are performing some form of aggregation, min, max etc operation on the data - these values can be calculated in the combiner for the map output, and then calculated again in the reducer for all the combined outputs.组合器通常适用于对数据执行某种形式的聚合、最小值、最大值等操作的问题 - 这些值可以在组合器中计算 map output,然后在缩减器中再次计算所有组合输出. This is useful as it means you are not transferring all the data across the.network between the mappers and the reducer.这很有用,因为它意味着您不会在映射器和缩减器之间通过网络传输所有数据。

Now there is not reason that you can't introduce a combiner to accumulate a list of the values observed for each key (i assume this is what your example shows), but there are some things which would make it tricker.现在没有理由不能引入组合器来累积每个键观察到的值列表(我假设这就是您的示例显示的内容),但是有些事情会使它变得更狡猾。

If you have to output <Text, Text> pairs from the mapper, and consume <Text, Text> in the reducer then your combiner can easily concatenate the list of values together and output this as a Text value.如果您必须从映射器中获取 output <Text, Text>对,并在 reducer 中使用<Text, Text>那么您的组合器可以轻松地将值列表连接在一起,并将 output 作为文本值。 Now in your reducer, you can do the same, concatenate all the values together and form one big output.现在在你的 reducer 中,你可以做同样的事情,将所有的值连接在一起,形成一个大的 output。

You may run into a problem if you wanted to sort and dedup the output list - as the combiner / reducer logic would need to tokenize the Text object back into words, sort and dedup the list and then rebuild the list of words.如果您想对 output 列表进行排序和去重,您可能会遇到问题 - 因为组合器/缩减器逻辑需要将文本 object 标记回单词,对列表进行排序和去重,然后重建单词列表。

To directly answer your question - when would it be appropriate, well i can think of some examples:直接回答你的问题 - 什么时候合适,我可以想到一些例子:

  • If you wanted to find the lexicographical smallest or largest value associated with each key如果你想找到与每个键关联的字典序最小或最大值
  • You have millions of values for each key and you want to 'randomly' sample a small set the values每个键都有数百万个值,并且您想“随机”抽取一小部分值

Combiner class is used when there is situation to use commutative or associative approach.组合器 class 在需要使用交换或关联方法的情况下使用。 Commutative example:交换示例:

abc=cba during combine task perform (a b=d),c and then send value of d,c to reducer. abc=cba 在组合任务期间执行 (a b=d),c 然后将 d,c 的值发送到 reducer。 Now the reducer has to perform only one task instead of two task ie a b = dd c to get final answer.现在,reducer 只需执行一项任务而不是两项任务,即 a b = dd c 以获得最终答案。 If you use combiner need to do only d c.如果你使用combiner只需要做d c。

Similarly for associative (a+b)+c = a+(b+c) Associative(Grouping) and commutative(moving around) result will not differ on how you multiply or add.同样,对于关联 (a+b)+c = a+(b+c) 关联(分组)和交换(四处移动)结果在乘法或加法上不会有所不同。 Mainly combiner is used for structured data which obeys Associative & commutative.组合器主要用于遵循关联和交换的结构化数据。

Advantage of combiner:组合器的优点:

  • It reduces.network I/O between Map and reducer它减少了 Map 和 reducer 之间的网络 I/O
  • It reduces Disk I/O in reducer as part of execution happens in Combiner.它减少了 reducer 中的磁盘 I/O,因为执行的一部分发生在 Combiner 中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM