简体繁体中英

How do I use distributed DNN training in TensorFlow?

原文 2015-11-09 19:23:22 4 3 python/ parallel-processing/ deep-learning/ tensorflow

Google released TensorFlow today.

I have been poking around in the code, and I don't see anything in the code or API about training across a cluster of GPU servers.

Does it have distributed training functionality yet?

3 answers

Updated:

The release occurred on 2/26/2016 and was announced by coauthor Derek Murray in the original issue here and uses gRPC for inter-process communication.

Previous:

Before the update above, a distributed implementation of TensorFlow had not been released yet. Support for a distributed implementation was the topic of this issue where coauthor Vijay Vasudevan wrote :

we are working on making a distributed implementation available, it's currently not in the initial release

and Jeff Dean later provided an update :

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.

It took us a few months, but today marks the release of the initial distributed TensorFlow runtime . This includes support for multiple machines, each with multiple GPUs, with communication provided by gRPC .

The current version includes the necessary backend components so that you can assemble a cluster manually and connect to it from a client program. More details are available in the readme .

Update

As you may have noticed. Tensorflow has already supported distributed DNN training for quite some time. Please refer to its offcial website for details.

=========================================================================

No, it doesn't support distribute training yet, which is a little disappointing. But I don't think it is difficult to extend from single machine to multi-machine. Compared to other open source libraries, like Caffe, TF's data graph structure is more suitable for cross-machine tasks.

How to use distributed training with a custom loss using Tensorflow?

How does asynchronous training work in distributed Tensorflow?

How do I use dask distributed?

Tensorflow input pipeline for distributed training

FLAGS and parsers in tensorflow distributed training

How to do prediction when use tensorflow nce_loss for training

tensorflow: how to make distributed training with tf.estimator.train_and_evaluate

How can I use a list of files as the training set on Sagemaker with Tensorflow?

How do I access the value of a label in Tensorflow during training/eval?

How do I mask multi-output in Tensorflow 2 LSTM training?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to use distributed training with a custom loss using Tensorflow? How does asynchronous training work in distributed Tensorflow? How do I use dask distributed? Tensorflow input pipeline for distributed training FLAGS and parsers in tensorflow distributed training How to do prediction when use tensorflow nce_loss for training tensorflow: how to make distributed training with tf.estimator.train_and_evaluate How can I use a list of files as the training set on Sagemaker with Tensorflow? How do I access the value of a label in Tensorflow during training/eval? How do I mask multi-output in Tensorflow 2 LSTM training?

Related Tags

How do I use distributed DNN training in TensorFlow?

Question

3 answers

solution1
15 ACCPTED 2015-11-11 00:00:48

solution2
8 2016-02-26 13:58:53

solution3
1 2015-11-10 03:33:47

Update

Previous

How do I use distributed DNN training in TensorFlow?

Question

3 answers

solution1 15 ACCPTED 2015-11-11 00:00:48

solution2 8 2016-02-26 13:58:53

solution3 1 2015-11-10 03:33:47

Update

Previous

solution1
15 ACCPTED 2015-11-11 00:00:48

solution2
8 2016-02-26 13:58:53

solution3
1 2015-11-10 03:33:47