您如何设计基于Erlang / OTP的分布式容错多核系统的架构？

Question

I would like to build an Erlang/OTP-based system which solves an 'embarassingly parrallel' problem. 我想构建一个基于Erlang / OTP的系统，它解决了一个“令人难以置信的并行”问题。

I have already read/skimmed through: 我已阅读/浏览过：

Learn You Some Erlang; 了解一些Erlang;
Programming Erlang (Armstrong); Erlang编程（阿姆斯特朗）;
Erlang Programming (Cesarini); Erlang编程（Cesarini）;
Erlang/OTP in Action. Erlang / OTP在行动中。

I have got the gist of Processes, Messaging, Supervisors, gen_servers, Logging, etc. 我有进程，消息，监督，gen_servers，Logging等的要点。

I do understand that certain architecture choices depend on the application in concern, but still I would like know some general principles of ERlang/OTP system design. 我确实理解某些架构选择取决于所关注的应用程序，但我仍然想知道ERlang / OTP系统设计的一些一般原则。

Should I just start with a few gen_servers with a supervisor and incrementally build on that? 我应该从一个主管的几个gen_servers开始，并逐步建立？

How many supervisors should I have? 我应该有多少名主管？ How do I decide which parts of the system should be process-based? 如何确定系统的哪些部分应基于流程？ How should I avoid bottlenecks? 我该如何避免瓶颈？

Should I add logging later? 我以后应该添加日志吗？

What is the general approach to Erlang/OTP distributed fault-tolerant multiprocessors systems architecture? Erlang / OTP分布式容错多处理器系统架构的一般方法是什么？

Answer 1

Should I just start with a few gen_servers with a supervisor and incrementally build on that? 我应该从一个主管的几个gen_servers开始，并逐步建立？

You're missing one key component in Erlang architectures here: applications! 你在这里错过了Erlang架构中的一个关键组件：应用程序！ (That is, the concept of OTP applications, not software applications). （也就是说，OTP应用程序的概念，而不是软件应用程序）。

Think of applications as components. 将应用程序视为组件。 A component in your system solves a particular problem, is responsible for a coherent set of resources or abstract something important or complex from the system. 系统中的一个组件解决了一个特定的问题，负责一组连贯的资源或从系统中抽象一些重要或复杂的东西。

The first step when designing an Erlang system is to decide which applications are needed. 设计Erlang系统的第一步是确定需要哪些应用程序。 Some can be pulled from the web as they are, these we can refer to as libraries. 有些可以按原样从网络中提取，我们可以将其称为库。 Others you'll need to write yourself (otherwise you wouldn't need this particular system). 您需要自己编写的其他人（否则您不需要这个特定的系统）。 These applications we usually refer to as the business logic (often you need to write some libraries yourself as well, but it is useful to keep the distinction between the libraries and the core business applications that tie everything together). 我们通常将这些应用程序称为业务逻辑（通常您也需要自己编写一些库，但保持库与将所有内容绑定在一起的核心业务应用程序之间的区别很有用）。

How many supervisors should I have? 我应该有多少名主管？

You should have one supervisor for each kind of process you want to monitor. 您应该为要监控的每种流程都配备一名主管。

A bunch of identical temporary workers? 一堆相同的临时工？ One supervisor to rule them all. 一位主管统治他们。

Different process with different responsibilities and restart strategies? 不同的流程有不同的职责和重启策略？ A supervisor for each different type of process, in a correct hierarchy (depending on when things should restart and what other process needs to go down with them?). 每个不同类型的流程的主管，处于正确的层次结构中（取决于什么时候应该重新启动以及其他流程需要与它们一起下去？）。

Sometimes it is okay to put a bunch of different process types under the same supervisor. 有时可以在同一个主管下放置一堆不同的流程类型。 This is usually the case when you have a few singleton processes (eg one HTTP server supervisor, one ETS table owner process, one statistics collector) that will always run. 当您有一些将始终运行的单个进程（例如，一个HTTP服务器管理程序，一个ETS表所有者进程，一个统计信息收集器）时，通常会出现这种情况。 In that case, it might be too much cruft to have one supervisor for each, so it is common to add the under one supervisor. 在这种情况下，每个人都有一个主管可能太过残忍，所以通常会在一个主管下面添加一个主管。 Just be aware of the implications of using a particular restart strategy when doing this, so you don't take down your statistics process for example, in case your web server crashes ( one_for_one is the most common strategy to use in cases like this). 请注意在执行此操作时使用特定重新启动策略的含义，因此您不会one_for_one统计信息过程，例如，万一您的Web服务器崩溃（ one_for_one是在这种情况下使用的最常见策略）。 Be careful not to have any dependencies between processes in a one_for_one supervisor. 注意不要在one_for_one主管中的进程之间存在任何依赖关系。 If a process depends on another crashed process, it can crash as well, triggering the supervisors' restart intensity too often and crash the supervisor itself too soon. 如果一个进程依赖于另一个崩溃的进程，它也会崩溃，过于频繁地触发主管的重启强度，并且过早地使主管本身崩溃。 This can be avoided by having two different supervisors, which would completely control the restarts by the configured intensity and period ( longer explanation ). 这可以通过具有两个不同的监督器来避免，这些监督者将通过配置的强度和周期完全控制重启（更长的解释）。

How do I decide which parts of the system should be process-based? 如何确定系统的哪些部分应基于流程？

Every concurrent activity in your system should be in it's own process. 系统中的每个并发活动都应该在它自己的进程中。 Having the wrong abstraction of concurrency is the most common mistake by Erlang system designers in the beginning. 错误的并发抽象是Erlang系统设计人员最常犯的错误。

Some people are not used to deal with concurrency; 有些人不习惯处理并发问题; their systems tend to have too little of it. 他们的系统往往太少了。 One process, or a few gigantic ones, that runs everything in sequence. 一个过程，或几个巨大的过程，按顺序运行一切。 These systems are usually full of code smell and the code is very rigid and hard to refactor. 这些系统通常充满代码气味，代码非常严格，难以重构。 It also makes them slower, because they may not use all the cores available to Erlang. 它也使它们变慢，因为它们可能不会使用Erlang可用的所有核心。

Other people immediately grasp the concurrency concepts but fail to apply them optimally; 其他人立即掌握并发概念，但未能以最佳方式应用它们; their systems tend to overuse the process concept, making many process stay idle waiting for others that are doing work. 他们的系统倾向于过度使用流程概念，使许多流程闲置等待正在工作的其他人。 These systems tend to be unnecessarily complex and hard to debug. 这些系统往往不必要地复杂且难以调试。

In essence, in both variants you get the same problem, you don't use all the concurrency available to you and you don't get the maximum performance out of the system. 从本质上讲，在两种变体中都会遇到同样的问题，您不会使用所有可用的并发性，并且您无法获得系统的最大性能。

If you stick to the single responsibility principle and abide by the rule to have a process for every truly concurrent activity in your system, you should be okay. 如果您坚持单一责任原则并遵守规则为您的系统中的每个真正并发活动创建流程，那么您应该没问题。

There are valid reasons to have idle processes. 有正当理由有闲置进程。 Sometimes they keep important state, sometimes you want to keep some data temporarily and later discard the process, sometimes they wait on external events. 有时他们会保持重要的状态，有时你想暂时保留一些数据，然后放弃这个过程，有时他们会等待外部事件。 The bigger pitfall is to pass important messages through a long chain of largely inactive processes, as it will slow down your system with lots of copying and use more memory. 更大的缺陷是通过长链非常不活跃的进程传递重要消息，因为它会通过大量复制减慢系统速度并使用更多内存。

How should I avoid bottlenecks? 我该如何避免瓶颈？

Hard to say, depends very much on your system and what it's doing. 很难说，很大程度上取决于你的系统以及它正在做什么。 Generally though, if you have a good division of responsibility between applications you should be able to scale the application that appears to be the bottleneck separately from the rest of the system. 但是，一般来说，如果您在应用程序之间有一个良好的责任分工，那么您应该能够将与该系统其他部分分开的应用程序扩展为瓶颈。

The golden rule here is to measure, measure, measure ! 这里的黄金法则是衡量，衡量，衡量 ！ Don't think you have something to improve until you've measured. 在你测量之前，不要认为你有什么需要改进的地方。

Erlang is great in that it allows you to hide concurrency behind interfaces (known as implicit concurrency). Erlang的优点在于它允许您隐藏接口后的并发（称为隐式并发）。 For example, you use a functional module API, a normal module:function(Arguments) interface, that could in turn spawn thousands of processes without the caller having to know that. 例如，您使用功能模块API，一个普通的module:function(Arguments)接口，它可以反过来生成数千个进程，而调用者不必知道这一点。 If you got your abstractions and your API right, you can always parallelize or optimize a library after you've started using it. 如果您的抽象和API正确，您可以在开始使用它之后始终并行化或优化库。

That being said, here are some general guide lines: 话虽如此，这里有一些一般的指导方针：

Try to send messages to the recipient directly, avoid channeling or routing messages through intermediary processes. 尝试直接向收件人发送邮件，避免通过中间进程引导或路由邮件。 Otherwise the system just spends time moving messages (data) around without really working. 否则系统会花费时间移动消息（数据）而不会真正起作用。
Don't overuse the OTP design patterns, such as gen_servers. 不要过度使用OTP设计模式，例如gen_servers。 In many cases, you only need to start a process, run some piece of code, and then exit. 在许多情况下，您只需要启动一个进程，运行一些代码，然后退出。 For this, a gen_server is overkill. 为此，gen_server是矫枉过正的。

And one bonus advice: don't reuse processes. 还有一个好处是：不要重复使用流程。 Spawning a process in Erlang is so cheap and quick that it doesn't make sense to re-use a process once its lifetime is over. 在Erlang中生成一个进程是如此便宜和快速，一旦它的生命周期结束，重用一个进程是没有意义的。 In some cases it might make sense to re-use state (eg complex parsing of a file) but that is better canonically stored somewhere else (in an ETS table, database etc.). 在某些情况下，重新使用状态（例如，文件的复杂解析）可能是有意义的，但是更好地规范地存储在其他地方（在ETS表，数据库等中）。

Should I add logging later? 我以后应该添加日志吗？

You should add logging now! 您应该立即添加日志记录！ There's a great built-in API called Logger that comes with Erlang/OTP from version 21: 有一个很棒的内置API，名为Logger ，它带有版本21的Erlang / OTP：

logger:error("The file does not exist: ~ts",[Filename]),
logger:notice("Something strange happened!"),
logger:debug(#{got => connection_request, id => Id, state => State},
             #{report_cb => fun(R) -> {"~p",[R]} end}),

This new API has several advanced features and should cover most cases where you need logging. 这个新API有几个高级功能，应该涵盖大多数需要记录的情况。 There's also the older but still widely used 3rd party library Lager . 还有较旧但仍广泛使用的第三方图书馆Lager 。

What is the general approach to Erlang/OTP distributed fault-tolerant multiprocessors systems architecture? Erlang / OTP分布式容错多处理器系统架构的一般方法是什么？

To summarize what's been said above: 总结一下上面说的：

Divide your system into applications 将您的系统划分为应用程序
Put your processes in the correct supervision hierarchy, depending on their needs and dependencies 根据需求和依赖关系，将您的流程置于正确的监督层次结构中
Have a process for every truly concurrent activity in your system 为系统中的每个真正并发活动创建一个流程
Maintain a functional API towards the other components in the system. 维护系统中其他组件的功能API。 This lets you: 这可以让你：
- Refactor your code without changing the code that's using it 重构代码而不更改使用它的代码
- Optimize code afterwards 之后优化代码
- Distribute your system when needed (just make a call to another node behind the API! The caller won't notice!) 在需要时分发您的系统（只需调用API后面的另一个节点！调用者不会注意到！）
- Test the code more easily (less work setting up test harnesses, easier to understand how to use it) 更轻松地测试代码（减少设置测试工具的工作量，更容易理解如何使用它）
Start using the libraries available to you in OTP until you need something different (you'll know, when the time comes) 开始使用OTP中可用的库，直到你需要不同的东西（你知道，到时候）

Common pitfalls: 常见的陷阱：

Too many processes 过程太多了
Too few processes 流程太少了
Too much routing (forwarded messages, chained processes) 路由太多（转发的消息，链接的进程）
Too few applications (I've never seen the opposite case, actually) 应用程序太少（实际上我从未见过相反的情况）
Not enough abstraction (makes it hard to refactor and reason about. It also makes it hard to test!) 没有足够的抽象（很难重构和推理。它也很难测试！）

您如何设计基于Erlang / OTP的分布式容错多核系统的架构？

问题描述

1 个解决方案

解决方案1
104 已采纳 2011-09-05 12:38:52

Should I just start with a few gen_servers with a supervisor and incrementally build on that? 我应该从一个主管的几个gen_servers开始，并逐步建立？

How many supervisors should I have? 我应该有多少名主管？

How do I decide which parts of the system should be process-based? 如何确定系统的哪些部分应基于流程？

How should I avoid bottlenecks? 我该如何避免瓶颈？

Should I add logging later? 我以后应该添加日志吗？

What is the general approach to Erlang/OTP distributed fault-tolerant multiprocessors systems architecture? Erlang / OTP分布式容错多处理器系统架构的一般方法是什么？

您如何设计基于Erlang / OTP的分布式容错多核系统的架构？

问题描述

1 个解决方案

解决方案1 104 已采纳 2011-09-05 12:38:52

Should I just start with a few gen_servers with a supervisor and incrementally build on that? 我应该从一个主管的几个gen_servers开始，并逐步建立？

How many supervisors should I have? 我应该有多少名主管？

How do I decide which parts of the system should be process-based? 如何确定系统的哪些部分应基于流程？

How should I avoid bottlenecks? 我该如何避免瓶颈？

Should I add logging later? 我以后应该添加日志吗？

What is the general approach to Erlang/OTP distributed fault-tolerant multiprocessors systems architecture? Erlang / OTP分布式容错多处理器系统架构的一般方法是什么？

解决方案1
104 已采纳 2011-09-05 12:38:52