简体   繁体   English

如何在TensorFlow中使用分布式DNN培训?

[英]How do I use distributed DNN training in TensorFlow?

Google released TensorFlow today. 谷歌今天发布了TensorFlow。

I have been poking around in the code, and I don't see anything in the code or API about training across a cluster of GPU servers. 我一直在寻找代码,我没有在代码或API中看到有关GPU服务器集群内的培训的任何内容。

Does it have distributed training functionality yet? 它是否具有分布式培训功能?

Updated: 更新:

The release occurred on 2/26/2016 and was announced by coauthor Derek Murray in the original issue here and uses gRPC for inter-process communication. 该发布于2016年2月26日发布,由合着者Derek Murray在此处的原始版本中宣布,并使用gRPC进行进程间通信。

Previous: 以前:

Before the update above, a distributed implementation of TensorFlow had not been released yet. 在上面的更新之前,尚未发布TensorFlow的分布式实现。 Support for a distributed implementation was the topic of this issue where coauthor Vijay Vasudevan wrote : 支持分布式实施是该问题的主题,共同作者Vijay Vasudevan 写道

we are working on making a distributed implementation available, it's currently not in the initial release 我们正在努力使分布式实现可用,它目前尚未在初始版本中

and Jeff Dean later provided an update : 杰夫迪恩后来提供了一个更新

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. 我们当前的内部分布式扩展与Google内部基础架构有些混淆,这就是我们首先发布单机版本的原因。 The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones. 该代码尚未在GitHub中,因为它目前依赖于Google代码库的其他部分,其中大部分已被修剪,但还有一些剩余部分。

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment. 我们意识到分布式支持非常重要,而且它是我们目前优先考虑的主要功能之一。

It took us a few months, but today marks the release of the initial distributed TensorFlow runtime . 它花了我们几个月,但今天标志着初始分布式TensorFlow运行时的发布 This includes support for multiple machines, each with multiple GPUs, with communication provided by gRPC . 这包括支持多台机器,每台机器都有多个GPU,并由gRPC提供通信。

The current version includes the necessary backend components so that you can assemble a cluster manually and connect to it from a client program. 当前版本包含必要的后端组件,以便您可以手动组装集群并从客户端程序连接到集群。 More details are available in the readme . 自述文件中提供了更多详细信息。

Update 更新

As you may have noticed. 你可能已经注意到了。 Tensorflow has already supported distributed DNN training for quite some time. Tensorflow已经支持分布式DNN培训很长一段时间了。 Please refer to its offcial website for details. 有关详细信息,请参阅其官方网站。

========================================================================= ================================================== =======================

Previous 以前

No, it doesn't support distribute training yet, which is a little disappointing. 不,它不支持分发培训,这有点令人失望。 But I don't think it is difficult to extend from single machine to multi-machine. 但我认为从单机扩展到多机并不困难。 Compared to other open source libraries, like Caffe, TF's data graph structure is more suitable for cross-machine tasks. 与其他开源库(如Caffe)相比,TF的数据图结构更适合跨机器任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM