[英]ClearML multiple tasks in single script changes logged value names
I trained multiple models with different configuration for a custom hyperparameter search.我为自定义超参数搜索训练了具有不同配置的多个模型。 I use pytorch_lightning and its logging (TensorboardLogger).
我使用 pytorch_lightning 及其日志记录(TensorboardLogger)。 When running my training script after Task.init() ClearML auto-creates a Task and connects the logger output to the server.
在 Task.init() 之后运行我的训练脚本时,ClearML 会自动创建一个任务并将记录器 output 连接到服务器。
I log for each straining stage train
, val
and test
the following scalars at each epoch: loss
, acc
and iou
我记录每个应变阶段
train
, val
并在每个时期test
以下标量: loss
, acc
和iou
When I have multiple configuration, eg networkA
and networkB
the first training log its values to loss
, acc
and iou
, but the second to networkB:loss
, networkB:acc
and networkB:iou
.当我有多个配置时,例如
networkA
和networkB
第一个训练将其值记录到loss
、 acc
和iou
,但第二个记录到networkB:loss
、 networkB:acc
和networkB:iou
。 This makes values umcomparable.这使得价值观无法比较。
My training loop with Task initalization looks like this:我的任务初始化训练循环如下所示:
names = ['networkA', networkB']
for name in names:
task = Task.init(project_name="NetworkProject", task_name=name)
pl_train(name)
task.close()
method pl_train is a wrapper for whole training with Pytorch Ligtning.方法 pl_train 是使用 Pytorch Ligtning 进行整个训练的包装器。 No ClearML code is inside this method.
此方法中没有 ClearML 代码。
Do you have any hint, how to properly use the usage of a loop in a script using completly separated tasks?您是否有任何提示,如何使用完全分离的任务在脚本中正确使用循环?
Edit: ClearML version was 0.17.4.编辑:ClearML 版本是 0.17.4。 Issue is fixed in main branch.
问题已在主分支中修复。
Disclaimer I'm part of the ClearML (formerly Trains) team.免责声明 我是 ClearML(前身为 Trains)团队的一员。
pytorch_lightning
is creating a new Tensorboard for each experiment. pytorch_lightning
正在为每个实验创建一个新的 Tensorboard。 When ClearML logs the TB scalars, and it captures the same scalar being re-sent again, it adds a prefix so if you are reporting the same metric it will not overwrite the previous one.当 ClearML 记录 TB 标量并捕获再次重新发送的相同标量时,它会添加一个前缀,因此如果您报告相同的指标,它不会覆盖前一个指标。 A good example would be reporting
loss
scalar in the training phase vs validation phase (producing "loss" and "validation:loss").一个很好的例子是在训练阶段和验证阶段报告
loss
标量(产生“损失”和“验证:损失”)。 It might be the task.close()
call does not clear the previous logs, so it "thinks" this is the same experiment, hence adding the prefix networkB
to the loss
.可能是
task.close()
调用没有清除以前的日志,所以它“认为”这是同一个实验,因此将前缀networkB
添加到loss
中。 As long as you are closing the Task after training is completed you should have all experiments log with the same metric/variant (title/series).只要您在训练完成后关闭任务,您就应该使用相同的指标/变量(标题/系列)记录所有实验。 I suggest opening a GitHub issue, this should probably be considered a bug.
我建议打开一个 GitHub 问题,这可能应该被认为是一个错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.