在OOZIE-4.1.0中运行多个工作流时出错

Question

I installed oozie 4.1.0 on a Linux machine by following the steps at http://gauravkohli.com/2014/08/26/apache-oozie-installation-on-hadoop-2-4-1/ 我按照http://gauravkohli.com/2014/08/26/apache-oozie-installation-on-hadoop-2-4-1/中的步骤在Linux机器上安装了oozie 4.1.0

hadoop version - 2.6.0 
maven - 3.0.4 
pig - 0.12.0

Cluster Setup - 群集设置 -

MASTER NODE runnig - Namenode, Resourcemanager ,proxyserver. MASTER NODE runnig - Namenode，Resourcemanager，proxyserver。

SLAVE NODE running -Datanode,Nodemanager. SLAVE NODE正在运行 -Datanode，Nodemanager。

When I run single workflow job means it succeeds. 当我运行单个工作流程时，工作意味着它成功。 But when I try to run more than one Workflow job ie both the jobs are in accepted state 但是当我尝试运行多个Workflow作业时，即两个作业都处于接受状态 在此输入图像描述

Inspecting the error log, I drill down the problem as, 检查错误日志，我深入研究了问题，

014-12-24 21:00:36,758 [JobControl] INFO  org.apache.hadoop.ipc.Client  - Retrying connect to server: 172.16.***.***/172.16.***.***:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-12-25 09:30:39,145 [communication thread] INFO  org.apache.hadoop.ipc.Client  - Retrying connect to server: 172.16.***.***/172.16.***.***:52406. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-12-25 09:30:39,199 [communication thread] INFO  org.apache.hadoop.mapred.Task  - Communication exception: java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable: no further information; Host Details : local host is: "SystemName/127.0.0.1"; destination host is: "172.16.***.***":52406; 
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1415)
 at org.apache.hadoop.ipc.Client.call(Client.java:1364)
 at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
 at $Proxy9.ping(Unknown Source)
 at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:742)
 at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.SocketException: Network is unreachable: no further information
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
 at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
 at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
 at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
 at org.apache.hadoop.ipc.Client.call(Client.java:1382)
 ... 5 more

Heart beat
Heart beat
.
.

In the above running jobs, if I kill any one launcher job manually (hadoop job -kill <launcher-job-id>) mean all jobs get succeeded. 在上面运行的作业中，如果我手动杀死任何一个启动器作业(hadoop job -kill <launcher-job-id>)意味着所有作业都会成功。 So I think the problem is more than one launcher job running simultaneously mean job will meet deadlock .. 所以我认为问题是不止一个启动器工作同时运行意味着工作将遇到僵局 ..

If anyone know the reason and solution for above problem. 如果有人知道上述问题的原因和解决方案。 Please do me the favor as soon as possible. 请尽快帮我。

Answer 1

I tried below solution it works perfectly for me. 我试过下面的解决方案，它对我来说很完美。

1) Change the Hadoop schedule type from capacity scheduler to fair scheduler . 1）将Hadoop调度类型从容量调度程序更改为公平调度程序 。 Because for small cluster each queue assign some memory size (2048MB) to complete single map reduce job. 因为对于小型集群，每个队列分配一些内存大小（2048MB）来完成单个映射减少作业。 If more than one map reduce job run in single queue mean it met deadlock . 如果在单个队列中运行多个映射减少作业，则意味着它遇到了死锁。

Solution : add below property to yarn-site.xml 解决方案 ：将以下属性添加到yarn-site.xml

  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.scheduler.fair.allocation.file</name>
    <value>file:/%HADOOP_HOME%/etc/hadoop/fair-scheduler.xml</value>
  </property>

2) By default Hadoop Total memory size was allot as 8GB. 2）默认情况下， Hadoop总内存大小分配为8GB。

So if we run two mapreduce program memory used by Hadoop get more than 8GB so it met deadlock . 所以如果我们运行两个mapreduce程序，Hadoop使用的程序内存会超过8GB，所以它遇到了死锁。

Solution : Increase the size of Total Memory of nodemanager using following properties at yarn-site.xml 解决方案 ：使用yarn-site.xml中的以下属性增加nodemanager的总内存大小

<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>20960</value>
  </property>
  <property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1024</value>
  </property>
  <property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
  </property>

So If user try to run more than two mapreduce program mean he need to increase nodemanager or he need to increase the size of total memory of Hadoop (note: Increasing the size will reduce the system usage memory. Above property file able to run 10 map reduce program concurrently.) 所以如果用户尝试运行两个以上的mapreduce程序意味着他需要增加nodemanager或者他需要增加Hadoop的总内存大小（注意：增加大小会减少系统使用内存。上面的属性文件能够运行10个map同时减少程序。）

Answer 2

The problem is with the Queue, When we running the Job in SAME QUEUE(DEFAULT) with above cluster setup the Resourcemanager is responsible to run mapreduce job in the salve node. 问题在于Queue，当我们使用上面的集群设置在SAME QUEUE（DEFAULT）中运行Job时，Resourcemanager负责在salve节点中运行mapreduce作业。 Due to lack of resource in slave node the job running in the queue will meet Deadlock situation. 由于从节点中缺少资源，队列中运行的作业将遇到死锁情况。

In order to over come this issue we need to split the Mapreduce job by means of Triggering the mapreduce job in different queue . 为了解决这个问题，我们需要通过触发不同队列中的mapreduce作业来拆分Mapreduce作业。

在此输入图像描述

you can do this by setting this part in the pig action inside your oozie workflow.xml 你可以通过在oozie workflow.xml里面的pig动作中设置这个部分来做到这一点

<configuration>
<property>
  <name>mapreduce.job.queuename</name>
  <value>launcher2</value>
</property>

NOTE: This solution only for SMALL CLUSTER SETUP 注意： 此解决方案仅适用于SMALL CLUSTER SETUP

在OOZIE-4.1.0中运行多个工作流时出错

问题描述

2 个解决方案

解决方案1
2 2015-01-09 07:01:15

解决方案2
1 已采纳 2015-01-06 04:56:27

在OOZIE-4.1.0中运行多个工作流时出错

问题描述

2 个解决方案

解决方案1 2 2015-01-09 07:01:15

解决方案2 1 已采纳 2015-01-06 04:56:27

解决方案1
2 2015-01-09 07:01:15

解决方案2
1 已采纳 2015-01-06 04:56:27