简体   繁体   中英

Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster


I just finished setting up a small hadoop cluster (using 3 ubuntu machines and apache hadoop 2.2.0) and am now trying to run python streaming jobs.

Running a test job I encounter the following problem:
Almost all map tasks are marked as successful but with note saying Container killed .

On the online interface the log for the map jobs says:
Progress 100.00
State SUCCEEDED

but under Note it says for almost every attempt (~200)
Container killed by the ApplicationMaster.
or
Container killed by the ApplicationMaster. Container killed on request. Exit code is 143

In the log file associated with the attempt I can see a log saying Task 'attempt_xxxxxxxxx_0' done.

I also get 3 attempts with the same log, only those 3 have
State KILLED
which are under killed jobs.

stderr output is empty for all jobs/attempts.

When looking at the application master log and following one of the successful (but killed) attempts I find the following logs:

  • Transitioned from NEW to UNASSIGNED
  • Transitioned from UNASSIGNED to ASSIGNED
  • several progress updates, including: 1.0
  • Done acknowledgement
  • RUNNING to SUCCESS_CONTAINER_CLEANUP
  • CONTAINER_REMOTE_CLEANUP
  • KILLING attempt_xxxx
  • Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
  • Task Transitioned from RUNNING to SUCCEEDED

All the attempts are numbered xxxx_0 so I assume they are not killed as a result of speculative execution.

Should I be worried about this? And what causes the containers to be killed? Any suggestions would be greatly appreciated!

Yes, I agree with @joshua. It seems to be a bug related to a task/container not dying gracefully after successfully finishing the map/reduce task. After the grace period, the ApplicationMaster has to kill it instead.

I am running 'yarn version'= Hadoop 2.5.0-cdh5.3.0

I picked one of the tasks and grep'ed for its history in the log generated for my MR application:

$ yarn logs -applicationId application_1422894000163_0003 |grep attempt_1422894000163_0003_r_000008_0

You will see that "attempt_1422894000163_0003_r_000008_0" goes through the "TaskAttempt Transitioned from NEW to UNASSIGNED .. to RUNNING to SUCCESS_CONTAINER_CLEANUP'.

In the step 'SUCCESS_CONTAINER_CLEANUP', you will see messages about this container being killed. After this container is killed, this attempt goes into the "TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED" step.

As far as I know, the same task is run on many nodes. As soon as one node returnes the result, tasks on onther nodes are killed. That's why job SUCCEEDED but single tasks are in KILLED state.

What version are you using? You may have encountered YARN-903: DistributedShell throwing Errors in logs after successfull completion

This is a logging bug only. (The manager is trying to stop already-finished containers.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM