Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster

Question

I just finished setting up a small hadoop cluster (using 3 ubuntu machines and apache hadoop 2.2.0) and am now trying to run python streaming jobs.

Running a test job I encounter the following problem:
Almost all map tasks are marked as successful but with note saying Container killed .

On the online interface the log for the map jobs says:
Progress 100.00
State SUCCEEDED

but under Note it says for almost every attempt (~200)
Container killed by the ApplicationMaster.
or
Container killed by the ApplicationMaster. Container killed on request. Exit code is 143

In the log file associated with the attempt I can see a log saying Task 'attempt_xxxxxxxxx_0' done.

I also get 3 attempts with the same log, only those 3 have
State KILLED
which are under killed jobs.

stderr output is empty for all jobs/attempts.

When looking at the application master log and following one of the successful (but killed) attempts I find the following logs:

Transitioned from NEW to UNASSIGNED
Transitioned from UNASSIGNED to ASSIGNED
several progress updates, including: 1.0
Done acknowledgement
RUNNING to SUCCESS_CONTAINER_CLEANUP
CONTAINER_REMOTE_CLEANUP
KILLING attempt_xxxx
Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
Task Transitioned from RUNNING to SUCCEEDED

All the attempts are numbered xxxx_0 so I assume they are not killed as a result of speculative execution.

Should I be worried about this? And what causes the containers to be killed? Any suggestions would be greatly appreciated!

Answer 1

Yes, I agree with @joshua. It seems to be a bug related to a task/container not dying gracefully after successfully finishing the map/reduce task. After the grace period, the ApplicationMaster has to kill it instead.

I am running 'yarn version'= Hadoop 2.5.0-cdh5.3.0

I picked one of the tasks and grep'ed for its history in the log generated for my MR application:

$ yarn logs -applicationId application_1422894000163_0003 |grep attempt_1422894000163_0003_r_000008_0

You will see that "attempt_1422894000163_0003_r_000008_0" goes through the "TaskAttempt Transitioned from NEW to UNASSIGNED .. to RUNNING to SUCCESS_CONTAINER_CLEANUP'.

In the step 'SUCCESS_CONTAINER_CLEANUP', you will see messages about this container being killed. After this container is killed, this attempt goes into the "TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED" step.

Answer 2

As far as I know, the same task is run on many nodes. As soon as one node returnes the result, tasks on onther nodes are killed. That's why job SUCCEEDED but single tasks are in KILLED state.

Answer 3

What version are you using? You may have encountered YARN-903: DistributedShell throwing Errors in logs after successfull completion

This is a logging bug only. (The manager is trying to stop already-finished containers.)

Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster

Question

3 answers

solution1
2 2015-02-02 17:57:20

solution2
0 2014-08-21 08:21:01

solution3
0 2014-11-12 15:23:38

Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster

Question

3 answers

solution1 2 2015-02-02 17:57:20

solution2 0 2014-08-21 08:21:01

solution3 0 2014-11-12 15:23:38

solution1
2 2015-02-02 17:57:20

solution2
0 2014-08-21 08:21:01

solution3
0 2014-11-12 15:23:38