How to specify uberization of a Hive query in Hadoop2?

Question

There is a new feature in Hadoop 2 called uberization . For example, this reference says:

Uberization is the possibility to run all tasks of a MapReduce job in the ApplicationMaster's JVM if the job is small enough. This way, you avoid the overhead of requesting containers from the ResourceManager and asking the NodeManagers to start (supposedly small) tasks.

What I can't tell is whether this just happens magically behind the scenes or does one need to do something for this to happen? For example, when doing a Hive query is there a setting (or hint) to get this to happen? Can you specify the threshold for what is "small enough"?

Also, I'm having trouble finding much about this concept - does it go by another name?

Answer 1

I found details in the YARN Book by Arun Murthy about "uber jobs":

An Uber Job occurs when multiple mapper and reducers are combined to use a single container. There are four core settings around the configuration of Uber Jobs found in the mapred-site.xml options presented in Table 9.3.

Here is table 9.3:

|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.enable     | Whether to enable the small-jobs "ubertask" optimization,  |
|                                   | which runs "sufficiently small" jobs sequentially within a |
|                                   | single JVM. "Small" is defined by the maxmaps, maxreduces, |
|                                   | and maxbytes settings. Users may override this value.      |
|                                   | Default = false.                                           |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxmaps    | Threshold for the number of maps beyond which the job is   |
|                                   | considered too big for the ubertasking optimization.       |
|                                   | Users may override this value, but only downward.          |
|                                   | Default = 9.                                               |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxreduces | Threshold for the number of reduces beyond which           |
|                                   | the job is considered too big for the ubertasking          |
|                                   | optimization. Currently the code cannot support more       |
|                                   | than one reduce and will ignore larger values. (Zero is    |
|                                   | a valid maximum, however.) Users may override this         |
|                                   | value, but only downward.                                  |
|                                   | Default = 1.                                               |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxbytes   | Threshold for the number of input bytes beyond             |
|                                   | which the job is considered too big for the uber-          |
|                                   | tasking optimization. If no value is specified,            |
|                                   | `dfs.block.size` is used as a default. Be sure to          |
|                                   | specify a default value in `mapred-site.xml` if the        |
|                                   | underlying file system is not HDFS. Users may override     |
|                                   | this value, but only downward.                             |
|                                   | Default = HDFS block size.                                 |
|-----------------------------------+------------------------------------------------------------|

I don't know yet if there is a Hive-specific way to set this or if you just use the above with Hive.

Answer 2

An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master. So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1 , then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.

SET mapreduce.job.ubertask.enable=TRUE;

So, the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

How to specify uberization of a Hive query in Hadoop2?

Question

2 answers

solution1
4 2014-06-19 21:45:31

solution2
1 2015-04-23 14:35:37

How to specify uberization of a Hive query in Hadoop2?

Question

2 answers

solution1 4 2014-06-19 21:45:31

solution2 1 2015-04-23 14:35:37

solution1
4 2014-06-19 21:45:31

solution2
1 2015-04-23 14:35:37