简体   繁体   English

使用DAG的Condor作业,其中一些作业需要运行同一主机

[英]Condor job using DAG with some jobs needing to run the same host

I have a computation task which is split in several individual program executions, with dependencies. 我有一个计算任务,该任务分为几个独立的程序执行,并具有依赖性。 I'm using Condor 7 as task scheduler (with the Vanilla Universe, due do constraints on the programs beyond my reach, so no checkpointing is involved), so DAG looks like a natural solution. 我正在使用Condor 7作为任务调度程序(在Vanilla Universe中,由于对程序的约束超出了我的能力范围,因此不涉及检查点),因此DAG看起来是很自然的解决方案。 However some of the programs need to run on the same host. 但是,某些程序需要在同一主机上运行。 I could not find a reference on how to do this in the Condor manuals. 在Condor手册中找不到如何执行此操作的参考。

Example DAG file: DAG文件示例:

JOB  A  A.condor 
JOB  B  B.condor 
JOB  C  C.condor    
JOB  D  D.condor
PARENT A CHILD B C
PARENT B C CHILD D

I need to express that B and D need to be run on the same computer node, without breaking the parallel execution of B and C. 我需要表示B和D需要在同一计算机节点上运行,而不会破坏B和C的并行执行。

Thanks for your help. 谢谢你的帮助。

Condor doesn't have any simple solutions, but there is at least one kludge that should work: Condor没有任何简单的解决方案,但是至少应该有一个可行的解决方案:

Have B leave some state behind on the execute node, probably in the form of a file, that says something like MyJobRanHere=UniqueIdentifier" . Use the STARTD_CRON support to detect this an advertise it in the machine ClassAd. Have D use Requirements=MyJobRanHere=="UniqueIdentifier" . A part of D's final cleanup, or perhaps a new node E, it removes the state. If you're running large numbers of jobs through, you'll probably need to clean out left-over state occasionally. 让B在执行节点上留下一些状态,可能是文件形式,类似于MyJobRanHere=UniqueIdentifier" 。使用STARTD_CRON支持来检测此状态并在机器ClassAd中播发它。请D使用Requirements=MyJobRanHere=="UniqueIdentifier" 。D的最后清理,或者一个新的节点E的一部分,它消除了状态。如果你正在运行大量的通过,你可能需要清除遗留的状态偶尔的工作。

I don't know the answer but you should ask this question on the Condor Users mailing list. 我不知道答案,但是您应该在Condor Users邮件列表中问这个问题。 The folks who support the DAG functionality in Condor monitor it and will respond. 在Condor中支持DAG功能的人们会对其进行监视并做出响应。 See this page for subscription information. 请参阅此页面以获取订阅信息。 It's fairly low traffic. 这是相当低的流量。

It's generally fairly difficult to keep two jobs together on the same host in Condor without locking them to a specific host in advance, DAG or no DAG. 通常,将两个作业同时保留在Condor的同一主机上而没有事先将它们锁定到特定主机,DAG或不锁定DAG相当困难。 I actually can't think of a really viable way to do this that would let B start before C or C start before B. If you were willing to enforce that B must always start before C you could make part of the work that Job B does when it starts running be modify the Requirements portion of Job C's ClassAd so that it has a "Machine == " string where is the name of the machine B landed on. 我实际上想不出一种可行的方法来使B在C之前开始,或者C在B之前开始。如果您愿意强制B必须始终在C之前开始,那么您可以参加工作B的工作在开始运行时执行的操作是修改作业C的ClassAd的“ 需求”部分,以使其具有“ Machine ==”字符串,其中是所登陆的计算机B的名称。 This would also require that Job C be submitted held or not submitted at all until B was running, B would also have to release it as part of its start up work. 这也将要求作业C提交,直到作业B运行为止才提交或完全不提交。作业B也必须将其释放,作为其启动工作的一部分。

That's pretty complicated... 那很复杂...

So I just had a thought: you could use Condor's dynamic startd/slots features and collapse your DAG to achieve what you want. 因此,我只是想了一下:您可以使用Condor的动态启动/插槽功能并折叠DAG来实现所需的功能。 In your DAG where you currently have two separate nodes, B and C, you would collapse this down into one node B' that would run both B and C in parallel when it starts on a machine. 在当前有两个单独的节点B和C的DAG中,您可以将其折叠为一个节点B',该节点在计算机上启动时将并行运行B和C。 As part of the job requirements you note that it needs 2 CPUs on a machine. 作为作业要求的一部分,您注意到它在一台机器上需要2个CPU。 Switch your startd's to use the dynamic slot configuration so machines advertise all of their resources and not just statically allocated slots. 切换您的起点以使用动态插槽配置,以便计算机通告其所有资源,而不仅是静态分配的插槽。 Now you have B and C running concurrently on one machine always. 现在,您可以始终在一台计算机上同时运行B和C。 There are some starvation issues with dynamic slots when you have a few multi-CPU jobs in a queue with lots of single-CPU jobs, but it's at least a more readily solved problem. 当队列中有几个多CPU作业且有很多单CPU作业时,动态插槽会出现一些饥饿问题,但这至少是一个更容易解决的问题。

Another option is to tag B' with a special job attribute: 另一种选择是使用特殊的作业属性标记B':

MultiCPUJob = True

And target it just at slot 1 on machines: 并将其定位到计算机上的插槽1:

Requirements = Slot == 1 &&  ...your other requirements...

And have a static slot startd policy that says, "If a job with MultiCPUJob=True tries to run on slot 1 on me preempt any job that happens to be in slot 2 on this machine because I know this job will need 2 cores/CPUs". 并有一个静态的插槽启动策略,其中指出:“如果具有MultiCPUJob = True的作业尝试在我的插槽1上运行,则抢占了该计算机上插槽2中的任何作业,因为我知道该作业将需要2个内核/ CPU ”。

This is inefficient but can be done with any version of Condor past 6.8.x. 这效率低下,但是可以使用6.8.x之后的任何版本的Condor来完成。 I actually use this type of setup in my own statically partitioned farms so if a job needs a machine all to itself for benchmarking it can happen without reconfiguring machines. 我实际上在我自己的静态分区服务器场中使用了这种类型的设置,因此,如果一项作业需要全部使用计算机进行基准测试,则无需重新配置计算机就可以实现。

If you're interested in knowing more about that preemption option let me know and I can point you to some further configuration reading in the condor-user list archives. 如果您有兴趣了解有关该抢占选项的更多信息,请告诉我,我可以为您指出在condor-user列表档案中的一些进一步的配置阅读。

The solution here is to use the fact that you can modify submit descriptions even while DAGMan is running as long as DAGMan has not yet submitted the node. 解决方案是利用以下事实:即使DAGMan尚未运行,您也可以在运行DAGMan的同时修改提交描述。 Assume a simple DAG of A -> B -> C . 假设A -> B -> C的简单DAG。 If you want all nodes to run on the same host you can do the following: 如果希望所有节点都在同一主机上运行,​​则可以执行以下操作:

  1. Define a POST script on node A. 在节点A上定义POST脚本。

  2. The post script searches condor_history for the ClusterId of the completed node A. Something like condor_history -l -attribute LastRemoteHost -m1 $JOB_ID ... You'll need to clean up the output and what not, but you'll be left with the host that ran node A. 后脚本在condor_history中搜索完成的节点A的ClusterId。类似condor_history -l -attribute LastRemoteHost -m1 $JOB_ID ...您需要清理输出,而不必清理,但是您将剩下运行节点A的主机。

  3. The post script then searches for and modifies dependent job submit files, inserting into them a job job requirement at the top of the submit file. 然后,发布脚本搜索并修改相关的作业提交文件,并在提交文件的顶部将作业要求插入其中。 Just make sure you build your job requirements incrementally so that they pick up this new requirement if it is present. 只需确保您逐步建立您的工作要求,以便他们提出新要求(如果有)。

  4. When the post script completes, DAGMan will then look to submit ready nodes, of which in this example we have one: B . 发布脚本完成后,DAGMan将寻找提交就绪节点的机会,在此示例中,我们有一个: B The submission of B will now be done with the new requirement you added in step 3, so that it will run on the same execute host as A . 现在,将使用您在步骤3中添加的新要求来完成B的提交,以便它将在与A相同的执行主机上运行。

I do this currently with numerous jobs. 我目前正在从事大量工作。 It works great. 效果很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM