简体   繁体   中英

Tensorflow Mirror Strategy and Horovod Distribution Strategy

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy.

From the documentation and the source code investigation I found that Horovod ( https://github.com/horovod/horovod ) is using Message Passing Protocol (MPI) to communicate between multiple nodes. Specifically it uses all_reduce, all_gather of MPI.

From my observation (I may be wrong) Mirror Strategy is also using all_reduce algorithm ( https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute ).

Both of them are using data-parallel, synchronous training approach. So I am a bit confused how they are different? Is the difference only in implementation or there are other (theoretical) difference?

And how is the performance of mirror strategy compared to horovod?

Mirror Strategy has its own all_reduce algorithm which use remote procedural calls (gRPC) under the hood.

Like you mentioned Horovod uses MPI/GLOO to communicate between multiple processes.

Regarding the performance, one of my colleagues have performed experiments before using 4 Tesla V100 GPUs using the codes from here . The results suggested that 3 settings work the best: replicated with all_reduce_spec=nccl , collective_all_reduce with properly tuned allreduce_merge_scope (eg 32), and horovod . I did not see significant differences among these 3.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM