简体繁体 English

Kubernetes 作业中的 Pod 时间限制 — 每个 pod 的.spec.activeDeadlineSeconds

[英]Pod time-limit in Kubernetes job — .spec.activeDeadlineSeconds per pod

原文 2020-04-28 14:30:17 3 1 kubernetes/ containers/ kubectl

As explained in the Kuberenetes docs on the topic of jobs:正如关于工作主题的Kuberenetes 文档中所解释的：

The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created.无论创建了多少 Pod， activeDeadlineSeconds适用于作业的持续时间。 Once a Job reaches activeDeadlineSeconds , all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded .一旦 Job 达到activeDeadlineSeconds ，其所有正在运行的 Pod 都将终止，并且 Job 状态将变为type: Failed with reason: DeadlineExceeded 。

However, what I want to do is limit the time of each pod .但是，我想做的是限制每个pod的时间。 If a pod takes too long, I want it to fail, but I want the other pods to continue, and for the job to create more pods if necessary.如果一个 pod 花费的时间太长，我希望它失败，但我希望其他 pod 继续运行，并在必要时创建更多 pod。

I'll explain a bit about my task, just to make the problem crystal clear.我将解释一下我的任务，只是为了让问题变得清晰。 The job consists of taking items from a Redis database, where the database serves as a sort of queue.该作业包括从 Redis 数据库中获取项目，其中数据库用作一种队列。 Each pod processes one item (well, the number might vary).每个 pod 处理一个项目（嗯，数量可能会有所不同）。 If a pod takes too long processing an item, I want it to fail.如果一个 pod 处理一个项目的时间太长，我希望它失败。 However, the other pods should continue, and the job should continue creating pods and retrieving more items from the database.但是，其他 pod 应该继续，并且作业应该继续创建 pod 并从数据库中检索更多项目。

1 个解决方案

Your use case seems identical to this example from the kubernetes docs.您的用例似乎与 kubernetes 文档中的此示例相同。
As you said, activeDeadlineSeconds is not the parameter you should be using here.正如你所说， activeDeadlineSeconds不是你应该在这里使用的参数。

I'm not sure why do you want the pod to fail if it can't process an item in a given time frame.我不确定如果 pod 无法在给定的时间范围内处理项目，你为什么希望它失败。 I see a few different approaches that you can take here, but more info on the nature of you problem is required to know which one to take.我看到您可以在此处采用几种不同的方法，但需要更多有关您问题性质的信息才能知道采用哪种方法。 One approach for solving your issue would be setting the job parallelism to the number of pods you'd like to run concurrently and set this behaviour in the code itself -解决问题的一种方法是将作业并行度设置为您希望同时运行的 pod 数量，并在代码本身中设置此行为 -

If the issue delaying the processing is transient, you would probably want to terminate the current transaction, keep the item in your queue and restart handling the same item如果延迟处理的问题是暂时的，您可能希望终止当前事务，将项目保留在队列中并重新开始处理相同的项目
If the same item has failed x times, it should be removed from the queue and pushed to some kind of dead letter queue to await troubleshooting at a later point in time如果同一项目失败 x 次，则应将其从队列中移除并推送到某种死信队列以等待稍后的故障排除

Another approach would be to fanning out the messages in the queue in a way that will spawn a worker pod for each message, same as this example depicts.另一种方法是以扇出队列中的消息的方式，为每条消息生成一个工作 pod，与此示例描述的相同。
Choosing this solution will indeed cause every pod taking too long to process the item to fail, and if you set the restartPolicy of the pods you create to never you should have a list of failed pods on your hands that correspond to the number of failed processed items.选择此解决方案确实会导致每个 pod 处理该项目的时间过长而失败，如果您将创建的 pod 的restartPolicy设置为never ，您应该拥有与失败处理的数量相对应的失败 pod 列表项目。

Having said all that, I don't failing the pods is the right approach here, and that keeping track of failed processing events should be done using instrumentation, either by container logs or metrics.说了这么多，我不让 pod 失败是正确的方法，应该使用检测来跟踪失败的处理事件，无论是通过容器日志还是指标。