简体   繁体   English

使用 Trains 跟踪单独的训练/测试过程

[英]Tracking separate train/test processes with Trains

In my setup, I run a script that trains a model and starts generating checkpoints.在我的设置中,我运行一个脚本来训练model 并开始生成检查点。 Another script watches for new checkpoints and evaluates them.另一个脚本监视新的检查点并评估它们。 The scripts run in parallel, so evaluation is just a step behind training.脚本并行运行,因此评估只是训练的一个步骤。

What's the right Tracks configuration to support this scenario?支持这种情况的正确轨道配置是什么?

disclaimer: I'm part of the allegro.ai Trains team免责声明:我是allegro.ai 火车团队的一员

Do you have two experiments?你有两个实验吗? one for testing one for training?一种用于测试,一种用于培训?

If you do have two experiments, then I would make sure the models are logged in both of them (which if they are stored on the same shared-folder/s3/etc will be automatic) Then you can quickly see the performance of each-one.如果您确实有两个实验,那么我会确保模型都登录到它们中(如果它们存储在同一个共享文件夹/s3/etc 中,这将是自动的)然后您可以快速查看每个模型的性能-一。

Another option is sharing the same experiment, then the second process adds reports to the original experiment, that means that somehow you have to pass to it the experiment id.另一种选择是共享同一个实验,然后第二个过程将报告添加到原始实验中,这意味着您必须以某种方式将实验 ID 传递给它。 Then you can do:然后你可以这样做:

task = Task.get_task(task_id='training_task_id`)
task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)

EDIT: Are the two processes always launched together, or is the checkpoint test a general purpose code?编辑:这两个进程总是一起启动,还是检查点测试是通用代码?

EDIT2:编辑2:

Let's assume you have main script training a model.假设您有训练 model 的主脚本。 This experiment has a unique task ID:此实验具有唯一的任务 ID:

my_uid = Task.current_task().id

Let's also assume you have a way to pass it to your second process (If this is an actual sub-process, it inherits the os environment variables so you could do os.environ['MY_TASK_ID']=my_uid )我们还假设您有办法将其传递给您的第二个进程(如果这是一个实际的子进程,它会继承 os 环境变量,因此您可以执行os.environ['MY_TASK_ID']=my_uid

Then in the evaluation script you could report directly into the main training Task like so:然后在评估脚本中,您可以直接向主要培训任务报告,如下所示:

train_task = Task.get_task(task_id=os.environ['MY_TASK_ID'])
train_task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)

@MichaelLitvin, We had the same issue, and also had the same names for everything we logged in train and test, since it comes from the same code (obviously). @MichaelLitvin,我们遇到了同样的问题,并且我们在训练和测试中登录的所有内容也具有相同的名称,因为它来自相同的代码(显然)。 In order to avoid train/test mess in trains' plots, we modified tensorflow_bind.py to add a different prefix for "train" and "validation" streams.为了避免在训练图中出现训练/测试混乱,我们修改了 tensorflow_bind.py 为“训练”和“验证”流添加不同的前缀。 Trains' bugfix was adding a logdir name (which was not that clear for us). Trains 的错误修复是添加一个 logdir 名称(这对我们来说不是很清楚)。

*This was done 1-2 years ago, so it might be redundant now *这是1-2年前完成的,所以现在可能是多余的

Cheers, Dagan干杯,达根

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM