如何将Hive并发映射器增加到4个以上？

Question

Summary 摘要

When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. 当我从配置单元中的表查询运行一个简单的select count（*）时，大型集群中只有两个节点用于映射。 I would like to use the whole cluster. 我想使用整个集群。

Details 细节

I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12). 我正在使用运行hdfs和Hive 1.2.1（IBM-12）的较大集群（每个节点有20个以上的节点，每个节点数超过200 GB）。

I have a table of several billion rows. 我有一个数十亿行的表。 When I perform a simple 当我执行一个简单的

select count(*) from mytable;

hive creates hundreds of map tasks, but only 4 are running simultaneously. 配置单元创建了数百个地图任务，但是只有四个同时运行。

This means that my cluster is mostly idle during the query which seems wasteful. 这意味着查询期间我的集群大部分处于空闲状态，这似乎很浪费。 I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. 我尝试过对正在使用的节点进行ssh'ing，但它们并未充分利用CPU或内存。 Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all. 我们的集群由Infiniband网络和Isilon文件存储作为后盾，这两个文件似乎都没有加载。

We are using mapreduce as the engine. 我们正在使用mapreduce作为引擎。 I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers). 我尝试消除对我可以找到的资源的任何限制，但它并没有改变仅使用两个节点（4个并发映射器）的事实。

The memory settings are as follows: 内存设置如下：

yarn.nodemanager.resource.memory-mb     188928  MB
yarn.scheduler.minimum-allocation-mb    20992   MB
yarn.scheduler.maximum-allocation-mb    188928  MB
yarn.app.mapreduce.am.resource.mb       20992   MB
mapreduce.map.memory.mb                 20992   MB
mapreduce.reduce.memory.mb              20992   MB

and we are running on 41 nodes. 并且我们在41个节点上运行。 By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. 根据我的计算，我应该能够获得41 * 188928/20992 = 369个map / reduce任务。 Instead I get 4. 相反，我得到4。

Vcore settings: Vcore设置：

yarn.nodemanager.resource.cpu-vcores       24
yarn.scheduler.minimum-allocation-vcores   1
yarn.scheduler.maximum-allocation-vcores   24
yarn.app.mapreduce.am.resource.cpu-vcores  1
mapreduce.map.cpu.vcores                   1
mapreduce.reduce.cpu.vcores                1

Is there are way to get hive/mapreduce to use more of my cluster? 有没有办法让配置单元/ mapreduce使用更多群集？
How would a go about figuring out the bottle neck? 如何解决瓶颈？
Could it be that Yarn is not assigning tasks fast enough? 可能是Yarn没有足够快地分配任务？

I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM). 我猜想使用tez可以提高性能，但是我仍然对为什么资源利用率如此有限（并且我们没有在ATM上安装它）感兴趣。

Answer 1

Running parallel tasks depends on your memory setting in yarn for example if you have 4 data nodes and your yarn memory properties are defined as below 运行并行任务取决于纱线中的内存设置，例如，如果您有4个数据节点，并且纱线的内存属性定义如下

yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb    1 GB
yarn.scheduler.maximum-allocation-mb    1 GB
yarn.app.mapreduce.am.resource.mb   1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb  1 GB

according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory 根据此设置，您有4个数据节点，因此总数yarn.nodemanager.resource.memory-mb将可用于启动容器的4 GB，并且由于容器可占用1 GB的内存，因此这意味着在任何给定时间点都可以启动4个容器，应用程序主机将使用一个容器，因此，由于应用程序主机，映射器和reducer各自使用1 GB内存，因此您最多可以在任何给定时间点运行3个mapper或reducer任务

so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task 因此，您需要增加yarn.nodemanager.resource.memory-mb来增加映射/减少任务的数量

PS - Here we are taking about maximum tasks that can be launched,it may be some less than that also PS-在这里，我们正在考虑可以启动的最大任务，可能比这还少

如何将Hive并发映射器增加到4个以上？

问题描述

Summary 摘要

Details 细节

The memory settings are as follows: 内存设置如下：

Vcore settings: Vcore设置：

1 个解决方案

解决方案1
0 2017-04-12 14:07:20

如何将Hive并发映射器增加到4个以上？

问题描述

Summary 摘要

Details 细节

The memory settings are as follows: 内存设置如下：

Vcore settings: Vcore设置：

1 个解决方案

解决方案1 0 2017-04-12 14:07:20

解决方案1
0 2017-04-12 14:07:20