简体   繁体   中英

How resilient is reporting to Trains server?

How would Trains go about sending any missing data to the server in the following scenarios?

  • Internet connection breaks temporarily while running an experiment
  • Internet connection breaks and doesn't come back before the experiment ends (any manual way to send all the data that was missed?)
  • The machine running Trains server resets in the middle of an experiment

Disclaimer: I'm part of the allegro.ai Trains team

  • Trains will auto retry to send logs, basically forever. The logs/metrics are sent in a background thread so it should not interfere with execution. You can set the backoff parameter, to control the retry frequency, by adjusting the sdk.network.iteration.retry_backoff_factor_sec parameter in your ~/trains.conf file, see example here
  • The experiment will try to flush all metrics to the backend when the experiment ends, ie the process will wait at_exit until all metrics are sent. This means if the connection was dropped, it will retry until it is up again. If the experiment was aborted manually, there is no way to capture/resend those lost metric reports. That said with the new 0.16 version, offline mode was introduced. This way one can run the entire experiment offline, then later report all logs/metrics/artifacts.
  • The Trains-Server machine is fully stateless (the states themselves are stored in the databases on the machine) this means that from the experiment perspective, the connection was dropped for a few minutes and then it's available again. To your question, if the Trains-Server restarted, it is transparent to all experiments and they continue as usual, no reports will be lost.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM