如何查找 Pod 崩溃的原因？

Question

Is there a way to see why a kubernetes pod is failing with the status "craskLoopBackOff" under a heavy load?有没有办法查看为什么 kubernetes pod 在重负载下失败并显示“craskLoopBackOff”状态？

I have a HorizontalPodAutoscaler which never kicks in. In its status it always shows low (Under 50%) cpu and memory usage.我有一个永远不会启动的 HorizontalPodAutoscaler。在它的状态下，它总是显示低（低于 50%）cpu 和内存使用率。

Tailing the application logs within the pods doesnt give any insights either.跟踪 pod 中的应用程序日志也不会提供任何见解。

Answer 1

Try looking at the Kubernetes events kubectl get events --sort-by='.lastTimestamp'尝试查看 Kubernetes 事件kubectl get events --sort-by='.lastTimestamp'

If you don't get anything meaningful out of events go to the specific node and see the kubelet logs journalctl -u kubelet如果您没有从事件中获得任何有意义的信息，请转到特定节点并查看 kubelet 日志journalctl -u kubelet

Answer 2

To get logs from a pod you should use:要从 pod 获取日志，您应该使用：

kubectl logs [podname] -p

You can also do kubelet logs but that's mostly for Cluster logs.您也可以执行kubelet logs但这主要用于集群日志。

If there is no logs that means your application did not produces any logs before the crash.如果没有日志，则意味着您的应用程序在崩溃前没有生成任何日志。 You would need to rewrite the app and for example add a memory dump on crush.您将需要重写应用程序，例如添加粉碎时的内存转储。

You mentioned that the pod is dying under heavy load but stats shows only 50% utilization.您提到pod在重负载下会死亡，但统计数据显示只有 50% 的利用率。 You should login to the pod and check yourself the load, maybe check how many files are being open because maybe you are hitting the limit.您应该登录到 pod 并检查自己的负载，也许检查正在打开的文件数量，因为您可能达到了限制。

You can read the Kubernetes docs about Application Introspection and Debugging and go over Debugging CrashLoopBackoffs with Init-Containers .您可以阅读有关应用程序内省和调试的 Kubernetes 文档，并阅读使用 Init-Containers 调试 CrashLoopBackoffs 。

You can also try running your image in Docker and checking logs there.您还可以尝试在 Docker 中运行您的图像并在那里检查日志。 There is a nice documentation about Logs and troubleshooting available.有一个关于日志和故障排除的很好的文档可用。

If you provide more details we might be more helpful.如果您提供更多详细信息，我们可能会更有帮助。

Answer 3

Below are some obvious reasons for crashloopbackoff, which I have observed:以下是我观察到的 crashloopbackoff 的一些明显原因：

waiting for some condition to be full-filled eg some secrets, failing healthcheck etc等待某个条件被填满，例如一些秘密，健康检查失败等
pod is running with burstable or besteffort QoS and is getting killed due to non-availability of resources on node pod 以可突发或尽力而为 QoS 运行，并且由于节点上的资源不可用而被终止

You can run this script to find the possible issues for pods in a namespace: https://github.com/dguyhasnoname/k8s-day2-ops/blob/master/namespace_scripts/debug_app_namespace.sh您可以运行此脚本以查找命名空间中 pod 的可能问题： https : //github.com/dguyhasnoname/k8s-day2-ops/blob/master/namespace_scripts/debug_app_namespace.sh

如何查找 Pod 崩溃的原因？

问题描述

3 个解决方案

解决方案1
1 2020-02-28 06:27:57

解决方案2
0 2020-02-28 10:30:26

解决方案3
0 2020-03-07 02:38:10

如何查找 Pod 崩溃的原因？

问题描述

3 个解决方案

解决方案1 1 2020-02-28 06:27:57

解决方案2 0 2020-02-28 10:30:26

解决方案3 0 2020-03-07 02:38:10

解决方案1
1 2020-02-28 06:27:57

解决方案2
0 2020-02-28 10:30:26

解决方案3
0 2020-03-07 02:38:10