简体繁体 English

集群的作业调度算法

[英]Job scheduling algorithm for cluster

原文 2015-02-07 22:36:18 3 2 algorithm/ hadoop/ scheduling/ scheduler/ distributed-computing

I'm searching for algorithm suitable for problem below: 我正在寻找适合以下问题的算法：

There are multiple computers(exact number is unknown). 有多台计算机（确切数字未知）。 Each computer pulls job from some central queue, completes job, then pulls next one. 每台计算机从某个中央队列中提取作业，完成作业，然后提取下一个。 Jobs are produced by some group of users. 作业是由某些用户组产生的。 Some users submit lots of jobs, some a little. 有些用户提交很多工作，有些则提交一些。 Jobs consume equal CPU time(not really, just approximation). 作业消耗相等的CPU时间（不是真的，只是近似值）。

Central queue should be fair when scheduling jobs. 安排作业时，中央队列应该公平。 Also, users who submitted lots of jobs should have some minimal share of resources. 另外，提交大量工作的用户应该拥有最少的资源份额。

I'm searching a good algorithm for this scheduling. 我正在为此计划寻找一个好的算法。

Considered two candidates: 被认为是两个候选人：

Hadoop-like fair scheduler. 类似于Hadoop的公平调度程序。 The problem here is: where can I take minimal shares here when my cluster size is unknown? 这里的问题是：当群集大小未知时，我可以在哪里获取最少的份额？
Associate some penalty with each user. 将罚款与每个用户相关联。 Increment penalty when user's job is scheduled. 安排用户的工作时增加罚款。 Use probability of scheduling job to user as 1 - (normalized penalty) . 使用向用户调度作业的概率为1 - (normalized penalty) 。 This is something like stride scheduling, but I could not find any good explanation on it. 这有点像大步调度，但是我找不到很好的解释。

2 个解决方案

when I implemented a very similar job runner (for a production system), I ended having each server up choose jobtypes at random. 当我实现了一个非常相似的工作运行程序（用于生产系统）时，我结束了让每个服务器随机选择工作类型的任务。 This was my reasoning -- 这是我的理由-

a glut of jobs from one user should not impact the chance of other users having their jobs run (user-user fairness) 来自一个用户的大量工作不应影响其他用户运行其工作的机会（用户-用户公平性）
a glut of one jobtype should not impact the chance of other jobtypes being run (user-job and job-job fairness) 一种工作类型过剩不应影响其他工作类型运行的机会（用户职位和工作职位公平性）
if there is only one jobtype from one user waiting to run, all servers should be running those jobs (no wasted capacity) 如果一位用户等待运行的作业类型只有一种，则所有服务器都应运行这些作业（不浪费容量）
the system should run the jobs "fairly", ie proportionate to the number of waiting users and jobtypes and not the total waiting jobs (a large volume of one jobtype should not cause scheduling to favor it) (jobtype fairness) 系统应“公平地”运行作业，即与等待的用户和作业类型的数量成正比，而不是与总的等待作业成比例（一种作业类型的数量过多，不应引起计划的支持）（作业类型公平性）
the number of servers can vary, and is not known beforehand 服务器的数量可以变化，并且事先未知
the waiting jobs, jobtypes and users metadata is known to the scheduler, but not the job data (ie, the usernames, jobnames and counts, but not the payloads) 调度程序知道等待的作业，作业类型和用户元数据，但作业数据不知道（即，用户名，作业名和计数，而不是有效负载）
I also wanted each server to be standalone, to schedule its own work autonomously without having to know about the other servers 我还希望每台服务器都是独立的，以自动安排自己的工作，而不必了解其他服务器

The solution I settled on was to track the waiting jobs by their {user,jobtype} attribute tuple, and have each scheduling step randomly select 5 tuples and from each tuple up to 10 jobs to run next. 我确定的解决方案是通过其{user，jobtype}属性元组跟踪等待的作业，并让每个调度步骤随机选择5个元组，然后从每个元组中最多选择10个要运行的作业。 The selected jobs were shortlisted to be run by the next available runner. 选定的工作入围，由下一个可用的运行者运行。 Whenever capacity freed up to run more jobs (either because jobs finished or because of secondary restrictions they could not run), ran another scheduling step to fetch more work. 每当释放出容量来运行更多作业时（由于作业完成或由于二级限制而导致它们无法运行），请运行另一个计划步骤以获取更多工作。

Jobs were locked atomically as part of being fetched; 作业被原子锁定，这是被提取的一部分。 the locks prevented them from being fetched again or participating in further scheduling decisions. 锁阻止了它们再次被获取或参与进一步的调度决策。 If they failed to run they were unlocked, effectively returning them to the pool. 如果它们无法运行，它们将被解锁，从而有效地将它们返回到池中。 The locks timed out, so the server running them was responsible for keeping the locks refreshed (if a server crashed, the others would time out its locks and would pick up and run the jobs it started but didn't complete) 锁超时，因此运行它们的服务器负责使锁保持刷新状态（如果服务器崩溃，其他服务器将使其锁超时，并拾取并运行它已启动但未完成的作业）

For my use case I wanted users A and B with jobs A.1, A.2, A.3 and B.1 to each get 25% of the resources (even though that means user A was getting 75% to user B's 25%). 对于我的用例，我希望具有作业A.1，A.2，A.3和B.1的用户A和B分别获得25％的资源（即使这意味着用户A在用户B的25中获得75％的资源）％）。 Choosing randomly between the four tuples probabilistically converges to that 25%. 在四个元组之间随机选择可能会收敛到25％。

If you want users A and B to each have a 50-50 split of resources, and have A's A.1, A.2 and A.3 get an equal share to B's B.1, you can run a two-level scheduler, and randomly choose users and from those users choose jobs. 如果您希望用户A和B分别拥有50-50的资源分配，并且让A的A.1，A.2和A.3与B的B.1拥有相等的份额，则可以运行两级调度程序，然后随机选择用户，然后从这些用户中选择工作。 That will distribute the resources among users equally, and within each user's jobs equally among the jobtypes. 这将在用户之间平均分配资源，并在作业类型之间平均分配每个用户的作业资源。

A huge number of jobs of a particular jobtype will take a long time to all complete, but that's always going to be the case. 大量特定工作类型的工作将花费很长时间才能完成，但情况总是如此。 By picking from across users then jobtypes the responsiveness of the job processing will not be adversely impacted. 通过从多个用户中进行选择，然后选择作业类型，将不会对作业处理的响应性产生不利影响。

There are lots of secondary restrictions that can be added (eg, no more than 5 calls per second to linkedin), but the above is the heart of the system. 可以添加许多次要限制（例如，每秒不超过5个打入linkedin的呼叫），但这是系统的核心。

You could try Torque resource management and Maui batch job scheduling software from Adaptive Computing . 您可以尝试使用Adaptive Computing的 Torque资源管理和Maui批处理作业计划软件。 Maui policies are flexible enough to fit your needs. 毛伊岛的政策足够灵活，可以满足您的需求。 It supports backfill, configurable job and user priorities and resource reservations. 它支持回填，可配置的作业和用户优先级以及资源预留。