简体繁体 English

遵循传奇模式的微服务中的非自愿中断/ SIGKILL 处理

[英]involuntary disruptions / SIGKILL handling in microservice following saga pattern

原文 2022-01-05 10:22:11 6 1 kubernetes/ microservices/ sigkill/ resiliency/ self-healing

Should i engineer my microservice to handle involuntary disruptions like hardware failure?我应该设计我的微服务来处理硬件故障等非自愿中断吗？ Are these disruptions frequent enough to be handled in a service running on AWS managed EKS cluster.这些中断是否频繁到足以在 AWS 托管 EKS 集群上运行的服务中处理。
Should i consider some design change in the service to handle the unexpected SIGKILL with methods like persisting the data at each step or will that be considered as over-engineering?我是否应该考虑在服务中进行一些设计更改以使用诸如在每个步骤中持久化数据之类的方法来处理意外的 SIGKILL，还是将其视为过度工程？

What standard way would you suggest for handling these involuntary disruptions if it is如果是，您会建议什么标准方法来处理这些非自愿中断
a) a restful service that responds typically in 1s(follows saga pattern). a) 一个安静的服务，通常在 1 秒内响应（遵循 saga 模式）。 b) a service that process a big 1GB file in 1 hour. b) 在 1 小时内处理 1GB 大文件的服务。

1 个解决方案

There are couple of ways to handle those disruptions.有几种方法可以处理这些中断。 As mentioned here here :如此处所述：

Here are some ways to mitigate involuntary disruptions:以下是一些减轻非自愿干扰的方法：

Ensure your pod requests the resources it needs.确保您的 pod 请求它需要的资源。

Replicate your application if you need higher availability.如果您需要更高的可用性，请复制您的应用程序。 (Learn about running replicated stateless and stateful applications.) （了解如何运行复制的无状态和有状态应用程序。）

For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)为了在运行复制的应用程序时获得更高的可用性，请将应用程序分布在机架之间（使用反关联）或跨区域（如果使用多区域集群）。

The frequency of voluntary disruptions varies.自愿中断的频率各不相同。

So:所以：

if your budget allows it, spread your app accross zones or racks, you can use Node affinity to schedule Pods on cetrain nodes,如果您的预算允许，将您的应用程序分布在各个区域或机架上，您可以使用节点亲和性在 cetrain 节点上安排 Pod，
make sure to configure Replicas, it will ensure that when one Pod receives SIGKILL the load is automatically directed to another Pod.确保配置副本，它将确保当一个 Pod 收到SIGKILL时，负载会自动定向到另一个 Pod。 You can read more about this here .您可以在此处阅读有关此内容的更多信息。
consider using DaemonSets , which ensure each Node runs a copy of a Pod.考虑使用DaemonSets ，它确保每个节点运行一个 Pod 的副本。
use Deployments for stateless apps and StatefulSets for stateful.对无状态应用使用部署，对有状态应用使用StatefulSet 。
last thing you can do is to write your app to be distruption tolerant.您可以做的最后一件事是编写您的应用程序以容忍干扰。