简体   繁体   English

工作失败后,Kubernetes吊舱消失了

[英]Kubernetes pods disappear after failed jobs

I am running Kubernetes jobs via cron. 我正在通过cron运行Kubernetes作业。 In some cases the jobs may fail and I want them to restart. 在某些情况下,作业可能会失败,我希望它们重新启动。 I'm scheduling the jobs like this: 我正在安排这样的工作:

kubectl run collector-60053 --schedule=30 10 * * * * --image=gcr.io/myimage/collector --restart=OnFailure --command node collector.js

I'm having a problem where some of these jobs are running and failing but the associated pods are disappearing, so I have no way to look at the logs and they are not restarting. 我遇到一个问题,其中某些作业正在运行并且失败,但是相关的Pod消失了,所以我无法查看日志并且它们没有重新启动。

For example: 例如:

$ kubectl get jobs | grep 60053
collector-60053-1546943400     1         0            1h
$ kubectl get pods -a | grep 60053
$    // nothing returned

This is on Google Cloud Platform running 1.10.9-gke.5 这是在运行1.10.9-gke.5的Google Cloud Platform上

Any help would be much appreciated! 任何帮助将非常感激!

EDIT: 编辑:

I discovered some more information. 我发现了更多信息。 I have auto-scaling setup on my GCP cluster. 我的GCP群集上有自动缩放设置。 I noticed that when the servers are removed the pods are also removed (and their meta data). 我注意到,当删除服务器时,pod也将被删除(及其元数据)。 Is that expected behavior? 那是预期的行为吗? Unfortunately this gives me no easy way to look at the pod logs. 不幸的是,这使我无法轻松查看吊舱日志。

My theory is that as pods fail, the CrashLoopBackOff kicks in and eventually auto-scaling decides that the node is no longer needed (it doesn't see the pod as an active workload). 我的理论是,随着Pod发生故障,CrashLoopBackOff会启动,并最终自动缩放决定不再需要该节点(它不会将Pod视为活动工作负载)。 At this point, the node goes away and so do the pods. 此时,节点消失,吊舱也消失。 I don't think this is expected behavior with Restart OnFailure but I basically witnessed this by watching it closely. 我不认为这是Restart OnFailure的预期行为,但我基本上是通过仔细观察来见证的。

After digging much further into this issue, I have an understating of my situation. 在深入研究这个问题之后,我对自己的情况有所轻描淡写。 According to issue 54870 on the Kubernetes repository, there are some problems with jobs when set to Restart=OnFailure. 根据Kubernetes存储库上的问题54870 ,设置为Restart = OnFailure时作业存在一些问题。

I have changed my configuration to use Restart=Never and to set a backoffLimit for the job. 我已将配置更改为使用Restart = Never并为作业设置backoffLimit。 Even though restart is set to never, in my testing with restart never, Kubernetes will actually restart the pods up to the backoffLimit setting and keep the error pods around for inspection. 即使将重新启动设置为永不,在我的测试中,从不重新启动,Kubernetes实际上将重新启动Pod直到backoffLimit设置,并保留错误Pod进行检查。

Get the name of the job 获取工作名称

kubectl get jobs --watch

Find the pod for that last scheduled job 查找上一个计划的工作的广告连播

pods=$(kubectl get pods --selector=job-name=nameofjob-xxxxx --output=jsonpath={.items..metadata.name})

Get pod logs 获取吊舱日志

echo $pods
kubectl logs $pods

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM