简体   繁体   English

如何确定AKS kubernetes集群故障的原因

[英]How to determine the cause of an AKS kubernetes cluster failure

I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive: 我有一个在英国南部托管的生产AKS kubernetes集群,该集群变得不稳定且无法响应:

图片1

From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible. 从图像中,您可以看到我有多个Pod,它们处于未就绪状态(即终止/未知),并且无法运行要运行的报表。

I can see from the insights grid that the issue starts at around 9.50pm last night 我从见解网格中可以看到问题从昨晚9.50pm开始

图片2

I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this. 我已经遍历了AKS服务本身中的日志,并通过Kibana日志来寻找故障发生时在集群上运行的应用程序的日志,但是我一直在努力寻找可能导致此问题的任何东西。

Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster. 幸运的是,我有两个集群在一个流量管理器下为生产提供服务,因此已将所有流量路由到正常的一个,但我担心的是我需要了解造成这种情况的原因,尤其是如果另一个发生同样的情况,因为我在生产期间会停机旋转一个新集群。

My question is am I missing any obvious places to look for information on what caused the issue? 我的问题是,我是否缺少任何明显的地方来寻找有关造成此问题的原因的信息? any event logs that may point to what the problem is? 任何可能指出问题所在的事件日志?

I would suggest examining K8s event log around the time your nodes went "not ready". 我建议您在节点“未准备就绪”的时间检查K8s事件日志。

Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. 尝试打开“ Insights”的“节点”选项卡,然后在出现问题的时间周围从上至下选择时间范围。 See what node statuses are. 查看什么节点状态。 Any pressures? 有压力吗? You can see that in the property panel to the right of the node list. 您可以在节点列表右侧的属性面板中看到它。 Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node. 属性面板还包含指向该时间范围的事件日志的链接...但是,请注意,链接到节点属性面板上的事件日志会构造一个复杂的查询,以仅显示标记有该节点的事件。

You can get this information with simpler queries (and run more fun queries as well) in the Logs. 您可以在日志中使用更简单的查询(也可以运行更多有趣的查询)来获取此信息。 Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need): 打开集群左侧菜单中的“日志”选项卡,然后执行与此查询类似的查询(将时间间隔更改为所需的时间间隔):

let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc

See if you have events indicating what went wrong. 查看是否有事件指示出了什么问题。 Also of interest you can look at node inventory on your cluster. 同样有趣的是,您可以查看集群上的节点清单。 Nodes report K8s status. 节点报告K8s状态。 It was "Ready" prior to the problem... Then something went wrong - what is the status? 问题发生之前是“准备就绪” ...然后出了点问题-状态如何? Out of Disk by chance? 磁盘出机了吗?

let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc

只是预感,但请检查https://github.com/Azure/AKS/issues/305,其中有一些步骤可以识别和纠正此问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何删除AKS引擎创建的kubernetes集群? - How to delete a kubernetes cluster created by AKS engine? 如何在 azure (AKS) 中的 Kube.netes 集群中附加磁盘 - How to attach a disk in Kubernetes cluster in azure (AKS) 如何将自定义域名应用到 azure kubernetes 服务(AKS)集群? - How to apply custom domain name to azure kubernetes services (AKS) cluster? 如何将Azure AKS Kubernetes Cluster自签名CA添加到GitLab CI / CD Kubernetes集成? - How to add an Azure AKS Kubernetes Cluster self-signed CA to GitLab CI/CD Kubernetes integration? 调整 Azure Kubernetes 服务 (AKS) 群集的大小 - Sizing Azure Kubernetes Services (AKS) Cluster 使用 azure devops 创建 aks kubernetes 集群时出错:发送请求失败:StatusCode=400 -- 原始错误:Code="QuotaExceeded" - error while creating aks kubernetes cluster using azure devops : Failure sending request: StatusCode=400 -- Original Error: Code="QuotaExceeded" 如何确定 Azure EventGrid 消息传递失败的原因? - How to determine cause of Azure EventGrid message delivery failure? 托管在 AKS 上的 kubernetes 集群中的 docker 网桥地址有什么意义? - What is the significance of docker bridge address in kubernetes cluster hosted on AKS? 我可以使用 centos 映像创建 Azure aks kubernetes 集群吗? - Can I create an Azure aks kubernetes cluster using a centos image? 是否有适用于Azure AKS(Azure kubernetes服务)群集的API网关 - Is there any API Gateway for Azure AKS (Azure kubernetes services) cluster
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM