简体   繁体   中英

How do I use distributed DNN training in TensorFlow?

Google released TensorFlow today.

I have been poking around in the code, and I don't see anything in the code or API about training across a cluster of GPU servers.

Does it have distributed training functionality yet?

Updated:

The release occurred on 2/26/2016 and was announced by coauthor Derek Murray in the original issue here and uses gRPC for inter-process communication.

Previous:

Before the update above, a distributed implementation of TensorFlow had not been released yet. Support for a distributed implementation was the topic of this issue where coauthor Vijay Vasudevan wrote :

we are working on making a distributed implementation available, it's currently not in the initial release

and Jeff Dean later provided an update :

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.

It took us a few months, but today marks the release of the initial distributed TensorFlow runtime . This includes support for multiple machines, each with multiple GPUs, with communication provided by gRPC .

The current version includes the necessary backend components so that you can assemble a cluster manually and connect to it from a client program. More details are available in the readme .

Update

As you may have noticed. Tensorflow has already supported distributed DNN training for quite some time. Please refer to its offcial website for details.

=========================================================================

Previous

No, it doesn't support distribute training yet, which is a little disappointing. But I don't think it is difficult to extend from single machine to multi-machine. Compared to other open source libraries, like Caffe, TF's data graph structure is more suitable for cross-machine tasks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM