并行处理应用程序中的负载平衡

Question

I'm building a network-distributed parallel processing application that uses a combination of CPU and GPU resources across many machines. 我正在构建一个网络分布式并行处理应用程序，它在许多机器上使用CPU和GPU资源的组合。

The app has to perform some very computationally expensive operations on a very large dataset over thousands of iterations: 该应用程序必须在数千次迭代的非常大的数据集上执行一些计算成本非常高的操作：

for step = 0 to requested_iterations
  for i = 0 to width
    for j = 0 to height
      for k = 0 to depth
        matrix[i,j,k] = G*f(matrix[i,j,k])

Also, the matrix operations have to be executed synchronously: that is, each iteration depends on the results of the frame that came immediately before it. 此外，矩阵运算必须同步执行：也就是说，每次迭代都取决于紧接在它之前的帧的结果。

The hardware available in this ad-hoc grid, comprising both dedicated servers and idle desktop machines, varies greatly in performance from machine to machine. 此ad-hoc网格中可用的硬件（包括专用服务器和空闲桌面计算机）在不同机器之间的性能差异很大。 I'm wondering what the best way is to balance the work load across the entire system. 我想知道最好的方法是平衡整个系统的工作量。

Some idiosyncracies: 一些特质：

The grid should be as robust as possible. 网格应尽可能健壮。 Some simulations require weeks to run, and it would be nice not to have to cancel a run if one out of 100 machines goes offline. 一些模拟需要数周才能运行，如果100台机器中有一台脱机，则不必取消运行会很好。
Some of the lower-end machines (desktops that are idle but have to wake up when someone logs in) may join and leave the grid at any time. 一些低端机器（闲置的桌面，但有人登录时必须唤醒）可以随时加入和离开网格。
The dedicated servers may also join and leave the grid, but this is predictable. 专用服务器也可以加入和离开网格，但这是可预测的。

So far, what the best idea I've been able to come up with is: 到目前为止，我能想出的最好的想法是：

Have each node track the time itself takes to process a group of n cells in the matrix (cells processed per unit time) and report this to a central repository. 让每个节点跟踪处理矩阵中的一组n个单元（每单位时间处理的单元）所花费的时间并将其报告给中央存储库。
Weight this time against the total time for a frame (across the entire grid) of the simulation and the total size of the problem domain. 此时加权模拟的一个帧（整个网格）的总时间和问题域的总大小。 So, each node would get a score expressed in work units (matrix cells) per time, and a scalar rating expressing its performance vs the rest of the grid. 因此，每个节点将获得以每单位工作单位（矩阵单元格）表示的分数，以及表示其与网格其余部分的性能的标量等级。
On each frame, distribute the work load based on those scores so that each machine finishes as close to the same time as possible. 在每个框架上，根据这些分数分配工作负荷，以便每台机器尽可能接近同一时间完成。 If machine A is 100x faster than machine B , it will receive 100x as many matrix cells to process in a given frame (assuming that the matrix size is large enough to warrant including the extra machines). 如果机器A比机器B快100倍，它将在给定帧中接收100倍的矩阵单元（假设矩阵大小足以保证包括额外的机器）。
Nodes that leave the grid (desktops that are logged into, etc.) will have their workload redistributed among the remaining nodes. 离开网格的节点（登录的桌面等）将在其余节点之间重新分配其工作负载。

Or , 或者，

Arrange the nodes in a tree structure, where each node has a "weight" assigned. 将节点排列在树结构中，其中每个节点都分配了“权重”。 Nodes that are higher in the tree have a weight based on their ability combined with that of their children. 树中较高的节点具有基于其能力与其子女的能力相结合的权重。 This weight is adjusted per frame. 每帧调整此重量。 When a node loses communication its child, it uses a cached tree graph to contact the orphaned children and re-balance its branch. 当节点失去与其子节点的通信时，它使用缓存的树图来联系孤立的子节点并重新平衡其分支。

If it makes a difference, the app is a combination of C# and OpenCL. 如果它有所不同，该应用程序是C＃和OpenCL的组合。

Links to papers, example apps, and especially tutorials are welcome. 欢迎链接到论文，示例应用程序，尤其是教程。

Edit 编辑

This isn't homework. 这不是功课。 I'm turning a simulator I wrote as part of my thesis into a more useful product. 我正在把我作为论文一部分写的模拟器变成一个更有用的产品。 Right now the work is distributed uniformly with no accounting for performance of each machine, and no facility to recover from machines joining or leaving the grid. 现在，工作统一分配，不考虑每台机器的性能，也没有设备从加入或离开电网的机器中恢复。

Thanks for the excellent, detailed responses. 感谢您提供优质，详尽的回复。

Answer 1

For heterogeneous clusters, I like to let each processor request a new job as the processor becomes available. 对于异构集群，我喜欢让每个处理器在处理器可用时请求新作业。 Implementation involves a light weight server that can handle many requests at a time (but usually only returns a job number). 实现涉及一个轻量级服务器，可以一次处理多个请求（但通常只返回一个作业号）。 Implementation might go something like this: 实现可能会是这样的：

Break the job down into its smallest components (we know there are 1000 tasks now) 将工作分解成最小的组件（我们知道现在有1000个任务）
Start a network server (preferably UDP with timeouts to avoid network congestion) which counts upwards 启动一个网络服务器（最好是带有超时的UDP，以避免网络拥塞）向上计数
Start your cluster processes. 启动集群进程。
Each process asks, "What job number should I perform?" 每个流程都会问：“我应该执行什么工作号码？” and the server replies with a number 并且服务器回复一个号码
As the process finishes, it asks for the next job number. 随着流程的完成，它会要求下一个工作号码。 When all tasks are complete, the server returns a -1 to the processes, so they shut down. 当所有任务完成后，服务器会向进程返回-1，因此它们会关闭。

This is a lighter weight alternative to what you suggest above. 这是您上面建议的轻量级替代品。 Your fast processors still do more work than your slower machines, but you don't have to calculate how long the tasks take. 您的快速处理器仍然比较慢的处理器做更多的工作，但您不必计算任务所需的时间。 If a processor drops out for whatever reason, it will stop asking for tasks. 如果处理器因任何原因退出，它将停止要求任务。 Your server could choose to recycle task numbers after a certain amount of time. 您的服务器可以选择在一定时间后回收任务编号。

This is pretty much what a cluster scheduler would do on its own, except the processors don't have startup and shutdown costs, so your individual tasks can be smaller without penalty. 这几乎是集群调度程序自己做的事情，除了处理器没有启动和关闭成本，因此您的个人任务可以更小而不会受到惩罚。

Answer 2

I would go for decentralized solution. 我会选择分散的解决方案。

Every node picks (not given) same amount of work from center. 每个节点从中心挑选（未给出）相同数量的工作。 After some run every node is able to deside for itself an average power of calculation and communicate it with others. 在一些运行之后，每个节点都能够为itself寻求平均计算能力并与其他人进行通信。

After all every node will have a table of every node's average calc power. 毕竟每个节点都有一个每个节点的平均计算能力的表。 Having this information (could be even persistant,why not?) each node can deside to "ask" some other node with more power to delegate a stuff to it by signing a contract. 拥有这些信息（可能是持久的，为什么不呢？）每个节点都可以通过签订合同来“请求”其他节点更有力地将一些东西委托给它。

Before every process start every node have to make broadcast signal about: "I start doing X". 在每个进程开始之前，每个节点都必须发出关于“我开始做X”的广播信号。 One time finished always broadcast: "I finished X". 有一次总是播出：“我完成了X”。

Well, it's no so easy , cause there will be case when you begin job, after your hard disk failed and you will never finish it. 嗯，这并不容易 ，因为当你开始工作时会出现这种情况，在你的硬盘发生故障并且你永远无法完成它之后。 Others, especially those ones who are waiting a result from you should figure out this and pick from the basket your job and begin the stuff from the beginning. 其他人，特别是那些等待你的结果的人应该弄清楚这一点并从篮子中挑选你的工作并从头开始。 Here come "ping" technique with timer. 这里有“ping”技术与计时器。

Bad: The first tuning time can take non indifferent amount of time. 不好：第一个调整时间可能需要非常不同的时间。

Good: You will have almost fault tolerant solution. 好：您将拥有几乎容错的解决方案。 Leave them for a week, and even if some of nodes fail your grid still alive and does its work. 将它们保留一个星期，即使某些节点发生故障，您的网格仍然存活并完成其工作。

Many years ago I did something like this and with pretty good results. 很多年前，我做过类似的事情并取得了不错的成绩。 But it wasn't definitely on such large scale as described by you. 但它并没有像您所描述的那样大规模。 And scale, actually, makes a difference. 实际上，规模也有所不同。

So the choice is up to you. 所以选择取决于你。

Hope this helps. 希望这可以帮助。

Answer 3

I wouldn't bother tracking those stats too much at the server level. 我不打算在服务器级别过多地跟踪这些统计数据。 Your going to introduce a fair amount of overhead. 你要引入相当多的开销。

Instead, the control server should just maintain a list of work units. 相反，控制服务器应该只维护一个工作单元列表。 As a client becomes available, let it grab the next unit in line and process it. 当客户端变得可用时，让它抓住下一个单元并进行处理。 Rinse, repeat. 冲洗，重复。

Once the list of work units for a given matrix is exhausted, allow currently incomplete work units to be reassigned. 一旦给定矩阵的工作单元列表用尽，则允许重新分配当前不完整的工作单元。

Examples based off of a matrix containing 10 work units and 5 servers. 基于包含10个工作单元和5个服务器的矩阵的示例。

Equally fast, all available: 同样快，全部可用：

Server 1 checks in and grabs unit 1. This proceeds for the next 4 machines (ie: Server 2 gets unit 2...) When unit 1 is done, server 1 then grabs unit 6. The others grab the rest. 服务器1检入并抓取单元1.接下来的4台机器（即：服务器2获得单元2 ......）当单元1完成时，服务器1然后抓住单元6.其他机器抓住其余的。 Once the last server checks in, the matrix is done. 一旦最后一台服务器签入，矩阵就完成了。

Low Disparate performance, all available: 低差异性能，全部可用：
You start the round robin again and the first 5 units are acquired by the servers. 您再次启动循环，服务器将获取前5个单元。 However, Server 1 takes 30% longer than the others. 但是，服务器1比其他服务器长30％。 This means Server 2 will grab unit 6. etc. At some point server 1 will check in unit 1, meanwhile units 2 through 5 will have been completed and 6 through 10 will have been assigned. 这意味着服务器2将抓住单元6.等等。在某些时刻，服务器1将检查单元1，同时单元2到5将完成，并且将分配6到10。 Server 1 is assigned unit 6 as it's not done yet. 服务器1被分配了单元6，因为它还没有完成。 However, Server 2 will check in it's completed work before Server 1 finishes. 但是，服务器2将在服务器1完成之前检入它已完成的工作。 No big deal, just throw away that last result. 没什么大不了的，只是扔掉最后的结果。

Huge Disparate Performance, all available 巨大的不同表现，全部可用
You start the round robin again and the first 5 units are acquired by the servers. 您再次启动循环，服务器将获取前5个单元。 Let's say Server 1 takes 400% more time than the others. 假设服务器1比其他服务器节省400％的时间。 This means Server 2 will grab unit 6, etc. After server 2 checks in unit 6 it will see that unit #1 is still in process. 这意味着服务器2将抓取单元6等。在服务器2检查单元6之后，它将看到单元＃1仍在进行中。 Go ahead and assign it to Server 2; 继续并将其分配给服务器2; which will complete it before Server 1 returns. 这将在服务器1返回之前完成。

In this case you should probably monitor for those machines that are consistently reporting work late and drop them from further consideration. 在这种情况下，您应该监视那些一直报告工作迟到的计算机并将其从进一步考虑中删除。 Of course, you will have to make some allowances for those that go offline due to shutdown or personal usage. 当然，由于关机或个人使用，您必须为那些离线的人做一些补贴。 Probably some type of weighted rating where once it drops below a certain threshold you simply deny it further work; 可能是某种类型的加权评级，一旦它低于某个阈值，你就会拒绝进一步的工作; perhaps the rating is reset every so often to allow rebalancing from a steady state it will meet. 也许评级会经常重置，以便从它将遇到的稳定状态重新平衡。

Machine disappears 机器消失了
This has the exact same plan as the "Huge Disparate Performance" listed above. 这与上面列出的“巨大的不同表现”具有完全相同的计划。 The only difference is that the machine will either never report in, or will do so after some unknown amount of time. 唯一的区别是机器将永远不会报告，或者会在一段未知的时间后报告。

If for some reason you have more machines than units then an interesting thing happens: multiple servers will be assigned the same work unit right off the bat. 如果由于某种原因你有更多的机器而不是单位，那么会发生一件有趣的事情：多个服务器将被立即分配到同一个工作单元。 You can either stop this by putting in place some type of delay (like a unit must be in process for x minutes before allowing it to be reassigned) or simply allow it to happen. 您可以通过设置某种类型的延迟来停止此操作（例如，在允许重新分配之前，单元必须处于x分钟的过程中）或者只是允许它发生。 This should be thought through. 应该考虑这一点。

What have we done? 我们做了什么？ First, we alleviated the need to track individual performance. 首先，我们减轻了追踪个人表现的需要。 Second, we've allowed for machines to just disappear while making sure the work is still completed. 其次，我们已经允许机器消失，同时确保工作仍然完成。 Third, we've ensured that the work will be completed in the least amount of time as possible. 第三，我们确保尽可能在最短的时间内完成工作。

It's a little more chatty than simply assigning blocks of multiple units to machines based on performance; 这比简单地根据性能将多个单元块分配给机器要简单得多; however, this allows for even the fast machines to be unplugged from the network while ensuring total recoverability. 但是，这使得即使是快速的机器也可以从网络中拔出，同时确保完全可恢复性。 Heck you could kill all of the machines and later turn on some of them to pick up where you left off. 哎呀你可以杀死所有的机器，然后打开其中一些机器去接你离开的地方。

并行处理应用程序中的负载平衡

问题描述

3 个解决方案

解决方案1
2 已采纳 2011-08-26 16:22:19

解决方案2
1 2011-08-26 17:13:16

解决方案3
1 2011-08-26 18:06:59

并行处理应用程序中的负载平衡

问题描述

3 个解决方案

解决方案1 2 已采纳 2011-08-26 16:22:19

解决方案2 1 2011-08-26 17:13:16

解决方案3 1 2011-08-26 18:06:59

解决方案1
2 已采纳 2011-08-26 16:22:19

解决方案2
1 2011-08-26 17:13:16

解决方案3
1 2011-08-26 18:06:59