简体   繁体   English

如何在Hadoop2中指定Hive查询的超级化?

[英]How to specify uberization of a Hive query in Hadoop2?

There is a new feature in Hadoop 2 called uberization . Hadoop 2中有一个名为uberization的新功能。 For example, this reference says: 例如, 这个引用说:

Uberization is the possibility to run all tasks of a MapReduce job in the ApplicationMaster's JVM if the job is small enough. 如果作业足够小,Uberization可以在ApplicationMaster的JVM中运行MapReduce作业的所有任务。 This way, you avoid the overhead of requesting containers from the ResourceManager and asking the NodeManagers to start (supposedly small) tasks. 这样,您就可以避免从ResourceManager请求容器并要求NodeManagers启动(假设很小)任务的开销。

What I can't tell is whether this just happens magically behind the scenes or does one need to do something for this to happen? 我无法分辨的是,这是否真的在幕后神奇地发生,还是需要为此发生一些事情? For example, when doing a Hive query is there a setting (or hint) to get this to happen? 例如,在进行Hive查询时是否有设置(或提示)来实现此目的? Can you specify the threshold for what is "small enough"? 你能指定“足够小”的门槛吗?

Also, I'm having trouble finding much about this concept - does it go by another name? 此外,我很难找到关于这个概念的东西 - 它是否有另一个名字?

I found details in the YARN Book by Arun Murthy about "uber jobs": 我在Arun Murthy的YARN书中找到了关于“超级工作”的细节:

An Uber Job occurs when multiple mapper and reducers are combined to use a single container. 当多个映射器和缩减器组合使用单个容器时,会发生Uber作业。 There are four core settings around the configuration of Uber Jobs found in the mapred-site.xml options presented in Table 9.3. 在表9.3中的mapred-site.xml选项中找到了有关Uber Jobs配置的四个核心设置。

Here is table 9.3: 这是表9.3:

|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.enable     | Whether to enable the small-jobs "ubertask" optimization,  |
|                                   | which runs "sufficiently small" jobs sequentially within a |
|                                   | single JVM. "Small" is defined by the maxmaps, maxreduces, |
|                                   | and maxbytes settings. Users may override this value.      |
|                                   | Default = false.                                           |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxmaps    | Threshold for the number of maps beyond which the job is   |
|                                   | considered too big for the ubertasking optimization.       |
|                                   | Users may override this value, but only downward.          |
|                                   | Default = 9.                                               |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxreduces | Threshold for the number of reduces beyond which           |
|                                   | the job is considered too big for the ubertasking          |
|                                   | optimization. Currently the code cannot support more       |
|                                   | than one reduce and will ignore larger values. (Zero is    |
|                                   | a valid maximum, however.) Users may override this         |
|                                   | value, but only downward.                                  |
|                                   | Default = 1.                                               |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxbytes   | Threshold for the number of input bytes beyond             |
|                                   | which the job is considered too big for the uber-          |
|                                   | tasking optimization. If no value is specified,            |
|                                   | `dfs.block.size` is used as a default. Be sure to          |
|                                   | specify a default value in `mapred-site.xml` if the        |
|                                   | underlying file system is not HDFS. Users may override     |
|                                   | this value, but only downward.                             |
|                                   | Default = HDFS block size.                                 |
|-----------------------------------+------------------------------------------------------------|

I don't know yet if there is a Hive-specific way to set this or if you just use the above with Hive. 我还不知道是否有特定于Hive的方法来设置它,或者你是否只使用上面的Hive。

An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master. 当多个映射器和缩减器组合在一起以在Application Master中执行时,就会发生Uber作业。 So assuming, the job that is to be executed has MAX Mappers <= 9 ; 假设,要执行的作业具有MAX Mappers <= 9; MAX Reducers <= 1 , then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM. MAX Reducers <= 1 ,然后资源管理器(RM)创建一个Application Master,并使用自己的JVM在Application Master中很好地执行作业。

SET mapreduce.job.ubertask.enable=TRUE; SET mapreduce.job.ubertask.enable = TRUE;

So, the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated. 因此,使用Uberised作业的优点是,Application Master执行的往返开销,通过向Resource Manager(RM)请求容器以及RM将容器分配给Application master来消除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM