如何确定AKS kubernetes集群故障的原因

Question

I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive: 我有一个在英国南部托管的生产AKS kubernetes集群，该集群变得不稳定且无法响应：

From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible. 从图像中，您可以看到我有多个Pod，它们处于未就绪状态（即终止/未知），并且无法运行要运行的报表。

I can see from the insights grid that the issue starts at around 9.50pm last night 我从见解网格中可以看到问题从昨晚9.50pm开始

I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this. 我已经遍历了AKS服务本身中的日志，并通过Kibana日志来寻找故障发生时在集群上运行的应用程序的日志，但是我一直在努力寻找可能导致此问题的任何东西。

Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster. 幸运的是，我有两个集群在一个流量管理器下为生产提供服务，因此已将所有流量路由到正常的一个，但我担心的是我需要了解造成这种情况的原因，尤其是如果另一个发生同样的情况，因为我在生产期间会停机旋转一个新集群。

My question is am I missing any obvious places to look for information on what caused the issue? 我的问题是，我是否缺少任何明显的地方来寻找有关造成此问题的原因的信息？ any event logs that may point to what the problem is? 任何可能指出问题所在的事件日志？

Answer 1

I would suggest examining K8s event log around the time your nodes went "not ready". 我建议您在节点“未准备就绪”的时间检查K8s事件日志。

Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. 尝试打开“ Insights”的“节点”选项卡，然后在出现问题的时间周围从上至下选择时间范围。 See what node statuses are. 查看什么节点状态。 Any pressures? 有压力吗？ You can see that in the property panel to the right of the node list. 您可以在节点列表右侧的属性面板中看到它。 Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node. 属性面板还包含指向该时间范围的事件日志的链接...但是，请注意，链接到节点属性面板上的事件日志会构造一个复杂的查询，以仅显示标记有该节点的事件。

You can get this information with simpler queries (and run more fun queries as well) in the Logs. 您可以在日志中使用更简单的查询（也可以运行更多有趣的查询）来获取此信息。 Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need): 打开集群左侧菜单中的“日志”选项卡，然后执行与此查询类似的查询（将时间间隔更改为所需的时间间隔）：

let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc

See if you have events indicating what went wrong. 查看是否有事件指示出了什么问题。 Also of interest you can look at node inventory on your cluster. 同样有趣的是，您可以查看集群上的节点清单。 Nodes report K8s status. 节点报告K8s状态。 It was "Ready" prior to the problem... Then something went wrong - what is the status? 问题发生之前是“准备就绪” ...然后出了点问题-状态如何？ Out of Disk by chance? 磁盘出机了吗？

let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc

Answer 2

只是预感，但请检查https://github.com/Azure/AKS/issues/305，其中有一些步骤可以识别和纠正此问题。

如何确定AKS kubernetes集群故障的原因

问题描述

2 个解决方案

解决方案1
1 2019-01-18 20:01:28

解决方案2
0 2019-01-25 22:09:10

如何确定AKS kubernetes集群故障的原因

问题描述

2 个解决方案

解决方案1 1 2019-01-18 20:01:28

解决方案2 0 2019-01-25 22:09:10

解决方案1
1 2019-01-18 20:01:28

解决方案2
0 2019-01-25 22:09:10