How to run MapReduce tasks in Parallel with hadoop 2.x?

Question

I would like my map and reduce tasks to run in parallel. However, despite trying every trick in the bag, they are still running sequentially. I read from How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce , that using the following formula, one can set the number of tasks running in parallel.

min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, 
 yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores)

However, I did that, as you can see from the yarn-site.xml and mapred-site.xml I am using below. But the tasks still run sequentially. Note that I am using the open source Apache Hadoop and not Cloudera. Would shifting to Cloudera solve the problem? Also note that my input files are big enough that dfs.block.size should also not be an issue.

yarn-site.xml

    <configuration>
    <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>131072</value>
    </property>
    <property>
      <name>yarn.nodemanager.resource.cpu-vcores</name>
      <value>64</value>
    </property>
    </configuration>

mapred-site.xml

    <configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:9001</value>
    </property>

    <property>
      <name>mapreduce.map.memory.mb</name>
      <value>16384</value>
    </property>

    <property>
      <name>mapreduce.reduce.memory.mb</name>
      <value>16384</value>
    </property>

    <property>
        <name>mapreduce.map.cpu.vcores</name>
        <value>8</value>
    </property>

    <property>
        <name>mapreduce.reduce.cpu.vcores</name>
        <value>8</value>
    </property>
    </configuration>

Answer 1

Container is the logical execution template reserved for the execution of Map/Reduce tasks on every node of the culster.

The yarn.nodemanager.resource.memory-mb property tells the YARN resource manager to reserve that much of ram memory for all containers to be dispatched in the node to execute Map/Reduce tasks. This is the maximum upper bound of the memory will be reserved for every container.

But in you case, the free memory in the node is almost 11GB, and you have configured yarn.nodemanager.resource.memory-mb to almost 128GB(131072) , mapreduce.map.memory.mb & mapreduce.reduce.memory.mb as 16GB . The required upper bound size for Map/Reduce containers is 16Gb wich is higher than 11GB of the free memory* . This could be a reason that you were allocated only one container in the node for execution.

We shall reduce the value of mapreduce.map.memory.mb , mapreduce.reduce.memory.mb properties than the value of free memory to get more than one container running in parallel.

Also see some ways to increase the free memory since its already more 90% of it used.

Hope this helps :) ..

How to run MapReduce tasks in Parallel with hadoop 2.x?

Question

1 answers

solution1
5 2015-04-30 11:49:37

How to run MapReduce tasks in Parallel with hadoop 2.x?

Question

1 answers

solution1 5 2015-04-30 11:49:37

solution1
5 2015-04-30 11:49:37