简体繁体 English

如何知道MapReduce作业是否已重新启动或重新开始？

[英]How to know if a MapReduce job has restarted or is a fresh start?

原文 2016-06-24 00:00:05 5 1 java/ hadoop/ mapreduce/ crash-recovery

I've a MapReduce job which I run using job.waitForCompletion(true) . 我有一个MapReduce作业，它使用job.waitForCompletion(true)运行。 If one/more reducer task gets killed or crashes during the execution of the job, the entire MapReduce job is restarted and mappers and reducers are executed again (documentation). 如果一个/多个reducer任务在作业执行期间被杀死或崩溃，则将重新启动整个MapReduce作业，并再次执行映射器和reducer（文档）。 Here are my questions: 这是我的问题：

1] Can we know at the start of the job if the job has started fresh or it has restarted because of some failure in the previous run? 1]我们能否在作业开始时知道该作业是重新开始还是由于先前运行中的某些故障而重新启动？ (This led me to Q2) （这导致我进入第二季度）

2] Can counters help? 2]柜台可以帮忙吗？ Does value of counters gets carried over if some tasks fail, which leads to restart of the whole job? 如果某些任务失败，计数器的值是否会保留下来，从而导致整个作业重新启动？

3] Is there any inbuilt checkpointing method provided by Hadoop which keeps track of previous computation and helps avoid doing the same computations done by mappers and reducers before failing/crashing? 3] Hadoop是否提供任何内置的检查点方法来跟踪先前的计算，并有助于避免在失败/崩溃之前执行由映射器和化简器进行的相同计算？

Sorry, if the questions are not phrased unclearly. 抱歉，如果您对这些问题的措词不明确。 Thanks for the help. 谢谢您的帮助。

1 个解决方案

Some correction to the terminology. 对术语进行了一些更正。 A job does not restart if one or more of its tasks fail. 如果作业的一项或多项任务失败，则该作业不会重新启动。 A task may get restarted. 任务可能会重新启动。 From a mapper/reducer context you can get https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html#getTaskAttemptID() which contains the attempt number as a last token of the id. 从mapper / reducer上下文中，您可以获取https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html#getTaskAttemptID（），其中包含尝试编号作为的最后一个标记ID。
Counter updates from failed task attempts are not aggregated in job totals, so there should be no fear of overcounting. 来自失败任务尝试的计数器更新不会汇总到作业总数中，因此不必担心计数过多。
Generally not. 通常不会。 Output of failed task is cleared by the framework. 框架清除了失败任务的输出。 If you are afraid to loose something that is expensive to calculate because of a task failure, I would recommend to split your job into multiple map/reduce phases. 如果您担心由于任务失败而失去昂贵的计算成本，我建议将您的工作分为多个map / reduce阶段。 You can also have your own mutable distributed cache, but that's not recommended either. 您也可以拥有自己的可变分布式缓存，但是也不建议这样做。