简体   繁体   English

向 Trains 服务器报告的弹性如何?

[英]How resilient is reporting to Trains server?

How would Trains go about sending any missing data to the server in the following scenarios?在以下情况下,Trains go 如何将任何丢失的数据发送到服务器?

  • Internet connection breaks temporarily while running an experiment运行实验时 Internet 连接暂时中断
  • Internet connection breaks and doesn't come back before the experiment ends (any manual way to send all the data that was missed?)互联网连接中断并且在实验结束前没有恢复(任何手动方式发送所有丢失的数据?)
  • The machine running Trains server resets in the middle of an experiment运行 Trains 服务器的机器在实验过程中重置

Disclaimer: I'm part of the allegro.ai Trains team免责声明:我是 allegro.ai 火车团队的一员

  • Trains will auto retry to send logs, basically forever.火车将自动重试发送日志,基本上是永远。 The logs/metrics are sent in a background thread so it should not interfere with execution.日志/指标在后台线程中发送,因此不应干扰执行。 You can set the backoff parameter, to control the retry frequency, by adjusting the sdk.network.iteration.retry_backoff_factor_sec parameter in your ~/trains.conf file, see example here您可以通过调整~/trains.conf文件中的sdk.network.iteration.retry_backoff_factor_sec参数来设置回退参数,以控制重试频率,请参见此处的示例
  • The experiment will try to flush all metrics to the backend when the experiment ends, ie the process will wait at_exit until all metrics are sent.当实验结束时,实验将尝试将所有指标刷新到后端,即进程将在_exit 等待,直到发送所有指标。 This means if the connection was dropped, it will retry until it is up again.这意味着如果连接被断开,它将重试,直到它再次启动。 If the experiment was aborted manually, there is no way to capture/resend those lost metric reports.如果手动中止实验,则无法捕获/重新发送那些丢失的指标报告。 That said with the new 0.16 version, offline mode was introduced.也就是说,在新的 0.16 版本中,引入了离线模式。 This way one can run the entire experiment offline, then later report all logs/metrics/artifacts.这样就可以离线运行整个实验,然后报告所有日志/指标/工件。
  • The Trains-Server machine is fully stateless (the states themselves are stored in the databases on the machine) this means that from the experiment perspective, the connection was dropped for a few minutes and then it's available again. Trains-Server 机器是完全无状态的(状态本身存储在机器上的数据库中),这意味着从实验的角度来看,连接断开了几分钟,然后再次可用。 To your question, if the Trains-Server restarted, it is transparent to all experiments and they continue as usual, no reports will be lost.对于您的问题,如果 Trains-Server 重新启动,它对所有实验都是透明的,并且它们照常继续,不会丢失任何报告。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM