简体繁体 English

管理来宾可执行文件依赖性-内部Service Fabric

[英]Managing guest executables dependencies - On premise Service Fabric

原文 2019-03-26 07:37:58 9 1 c#/ azure-service-fabric/ dependency-management/ service-fabric-on-premises

We have recently decided to start using on-premise Service Fabric and have encountered a 'dependency' problem. 我们最近决定开始使用本地Service Fabric，并且遇到了“依赖性”问题。

We have several guest executables which have dependencies between them, and can't recover from a restart of the service they are dependant on without a restart themselves. 我们有几个来宾可执行文件，它们之间具有依赖性，如果不重新启动它们就无法从重新启动它们依赖的服务中恢复。

An example to make it clear: 一个清楚的例子：

In the chart below service B is dependant on service A. If service A encounters an unexpected error and gets restarted, service B will go into an 'error' state (which won't be reported to the fabric). 在下面的图表中，服务B依赖于服务A。如果服务A遇到意外错误并重新启动，则服务B将进入“错误”状态（不会报告给光纤网）。 This means service B will report an OK health state although it's in an error state. 这意味着服务B尽管处于错误状态，但仍将报告OK健康状态。

We were thinking of a solution around these lines: 我们正在考虑以下方面的解决方案：

Raise an independent service which monitors the health state events of all replicas/partitions/applications in the cluster and contains the entire dependency tree. 提出一个独立的服务，该服务监视群集中所有副本/分区/应用程序的运行状况事件，并包含整个依赖关系树。

When the health state of a service changes, it restarts its direct dependencies, which will cause a domino effect of events -> restarts untill the entire subtree has been reset (as shown in the Event-> Action flow chart bellow). 当服务的健康状态更改时，它将重新启动其直接依赖关系，这将导致事件的多米诺骨牌效应->重新启动，直到重置整个子树为止（如下面的“事件”->“操作”流程图所示）。

The problem is the healthReport events don't get sent within short intervals of time (meaning my entire system could not work and I wouldn't know for a few a minutes). 问题是healthReport事件不会在短时间内发送（这意味着我的整个系统无法正常工作，而且几分钟后我也不知道）。 I would monitor the health state, but I do need to know history (even if the state is healthy now, it doesn't mean it wasn't in error state earlier). 我将监视运行状况，但是我确实需要了解历史记录（即使该状况现在处于健康状态，也并不意味着它之前没有处于错误状态）。

Another problem is that the events could pop at any service level (replica/partition), and it would require me to aggregate all the events. 另一个问题是事件可以在任何服务级别（副本/分区）弹出，并且需要我汇总所有事件。

I would really appreciate any help on the matter. 我真的很感激此事。 I am also completely open to any other suggestion for this problem, even if it's in a completely other direction. 对于这个问题，我也完全持开放态度，即使这是完全相反的方向。

1 个解决方案

Cascading failures in services can generally be avoided by introducing fault tolerance at the communication boundaries between services. 通常，可以通过在服务之间的通信边界处引入容错功能来避免服务中的级联故障。 A few strategies to achieve this: 实现此目的的一些策略：

Introduce retries for failed operations with a delay in between. 为失败的操作引入重试，之间有一定的延迟。 The time between delays may grow exponentially. 延迟之间的时间可能成倍增长。 This is an easy option to implement if you are currently doing a lot of remote procedure call (RPC) style communication between services. 如果您当前在服务之间进行大量的远程过程调用（RPC）风格的通信，则这是一个易于实现的选项。 It may be very effective if your dependent services don't take too long to restart. 如果您的依存服务重新启动的时间不长，这可能会非常有效。 Polly is a well-known library for implementing retries. Polly是一个著名的实现重试的库。
Use circuit breakers to close down communications with failing services. 使用断路器来关闭与失败服务的通信。 In this metaphor, a closed circuit is formed between two services communicating normally. 在这个比喻中，在正常通信的两个服务之间形成了闭路。 The circuit breaker monitors the communications. 断路器监视通信。 If it detects some number of failed communications, it 'opens' the circuit, causing any further communications to fail immediately. 如果它检测到一定数量的通信失败，则会“断开”电路，导致进一步的通信立即失败。 The circuit breaker then sends periodic queries to the failing service to check its health, and closes the circuit once the failing service becomes operational. 然后，断路器将定期查询发送到故障服务以检查其运行状况，并在故障服务开始运行后关闭电路。 This is a little more involved than retry policies since you are responsible for preventing an open circuit from crashing your service, and also for deciding what constitutes a healthy service. 这比重试策略要复杂得多，因为您有责任防止开路导致服务崩溃，并负责确定构成健康服务的内容。 Polly also supports circuit breakers Polly还支持断路器
Use queues to form fully asynchronous communication between services. 使用队列在服务之间形成完全异步的通信。 Instead of communicating directly from service B to A, queue outbound operations to A in service B. Process the queue in its own thread - do not allow communication failures to escape the queue processor. 将出站操作排队到服务B中的A，而不是直接从服务B与A进行通信。在其自己的线程中处理队列-不允许通信失败逃逸到队列处理器。 You may also add an inbound queue to service A to receive messages from service B's outbound queue to completely isolate message processing from the network. 您还可以向服务A添加入站队列，以从服务B的出站队列接收消息，以将消息处理与网络完全隔离。 This is probably the most durable but also the most complex as it requires a very different architecture from RPC, and you must also decide how to deal with messages which fail repeatedly. 这可能是最持久的，也是最复杂的，因为它需要与RPC完全不同的体系结构，并且您还必须决定如何处理反复失败的消息。 You might retry failed messages immediately, send them to the back of the queue after a delay, send them to a dead letter collection for manual processing, or drop the message altogether. 您可以立即重试失败的消息，在延迟后将它们发送到队列的后面，将它们发送到无效信件集合以进行手动处理，或者完全删除该消息。 Since you're using guest executables you don't have the luxury of reliable collections to help with this process, so a third party solution like RabbitMQ might be useful if you decide to go this way. 由于您使用的是来宾可执行文件，因此没有足够的可靠集合来帮助完成此过程，因此，如果您决定采用这种方式，那么像RabbitMQ这样的第三方解决方案可能会很有用。