简体   繁体   English

HTCondor - 可分区插槽不起作用

[英]HTCondor - Partitionable slot not working

I am following the tutorial on Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot .我正在按照 HTCondor 网站中关于高吞吐量计算中心配置简介的教程来设置一个可分区插槽 Before any configuration I run在我运行任何配置之前

condor_status

and get the following output .并得到以下输出

I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.我通过在文件末尾添加以下行来更新/etc/condor/config.d中的文件00-minicondor

NUM_SLOTS = 1 
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE

and reconfigure并重新配置

 sudo condor_reconfig

Now with现在有了

condor_status

I get this output as expected.我按预期得到了这个输出 Now, I run the following command to check everything is fine现在,我运行以下命令来检查一切是否正常

condor_status -af Name Slotype Cpus

and find slot1@ip-172-31-54-214.ec2.internal undefined 1 instead of slot1@ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect.并找到我所期望的slot1@ip-172-31-54-214.ec2.internal undefined 1而不是slot1@ip-172-31-54-214.ec2.internal Partitionable 4 61295 Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.此外,当我尝试完成一项要求超过 1 个 cpu的工作时,它并没有为它分配空间(它会永远等待)。

I don't know if I made some mistake during the installation process or what could be happening.我不知道我在安装过程中是否犯了一些错误或可能发生什么。 I would really appreciate any help!我真的很感激任何帮助!

EXTRA INFO: If it can be of any help have have installed HTCondor with the command额外信息:如果有任何帮助,已经使用命令安装了 HTCondor

curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run

on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).在旧p2.xlarge实例(它有 4 个核心)上运行的Ubuntu 18.04上。

UPDATE: After rebooting the whole thing it seems to be working.更新:重新启动整个过程后,它似乎正在工作。 I can now send jobs with different CPUs requests and it will start them properly.我现在可以发送具有不同 CPU 请求的作业,它会正确启动它们。

The only issue I would say persists is that Memory allocation is not showing properly, for example:我要说的唯一问题是内存分配没有正确显示,例如:

在这种情况下

But in reality it is allocating enough memory for the job (in this case around 12 GB).但实际上它为作业分配了足够的内存(在本例中约为 12 GB)。

If I run again condor_status -af Name Slotype Cpus I still get something I am not supposed to如果我再次运行 condor_status -af Name Slotype Cpus 我仍然会得到一些我不应该得到的东西

未定义的问题

But at least it is showing the correct number of CPUs (even if it just says undefined).但至少它显示了正确的 CPU 数量(即使它只是说未定义)。

当作业空闲时condor_q -better的输出是什么?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM