为什么分布式TensorFlow玩具示例需要太长时间？

Question

我尝试运行一个玩具示例来进行一些矩阵乘法和使用分布式TensorFlow的加法。

我的目标是计算(A^n + B^n)其中A[,]和B[,]是LxL矩阵。

我在公共云上使用了2台机器来计算A^n台机器上的A^n和第二台机器上的B^n ，而不是在第一台机器上再次添加。

当机器只有CPU时 - 我的脚本效果很好。
当两者都拥有GPU时 - 它无法在合理的时间内运行！ 它有很大的延迟......

我的问题 - 在我的剧本中我做错了什么？

请注意，对于machine2（ task:1 ），我使用了server.join()并使用了机器1（ task:0 ）作为此with-in图中的客户端。

#------------------------------------------------------------------
from zmq import Stopwatch; aClk_E2E = Stopwatch(); aClk_E2E.start()
#------------------------------------------------------------------
from __future__ import print_function
import numpy as np
import tensorflow as tf
import datetime

IP_1 = '10.132.0.2';     port_1 = '2222'
IP_2 = '10.132.0.3';     port_2 = '2222'

cluster = tf.train.ClusterSpec( { "local": [ IP_1 + ":" + port_1,
                                             IP_2 + ":" + port_2
                                             ],
                                   }
                                )
server = tf.train.Server( cluster,
                          job_name   = "local",
                          task_index = 0
                          )
# server.join() # @machine2 ( task:1 )

n =    5
L = 1000  

def matpow( M, n ):
    if n < 1:                 # Abstract cases where n < 1
        return M
    else:
        return tf.matmul( M, matpow( M, n - 1 ) )

G = tf.Graph()

with G.as_default():
     with tf.device( "/job:local/task:1/cpu:0" ):
          c1 = []
          tB = tf.placeholder( tf.float32, [L, L] )     # tensor B placeholder
          with tf.device( "/job:local/task:1/gpu:0" ):
               c1.append( matpow( tB, n ) )

     with tf.device( "/job:local/task:0/cpu:0" ):
          c2 = []
          tA = tf.placeholder( tf.float32, [L, L] )     # tensor A placeholder
          with tf.device( "/job:local/task:0/gpu:0" ):
               c2.append( matpow( tA, n ) )
          sum2 = tf.add_n( c1 + c2 )
#---------------------------------------------------------<SECTION-UNDER-TEST>
t1_2 = datetime.datetime.now()
with tf.Session( "grpc://" + IP_1 + ":" + port_1, graph = G ) as sess:
     A = np.random.rand( L, L ).astype( 'float32' )
     B = np.random.rand( L, L ).astype( 'float32' )
     sess.run( sum2, { tA: A, tB: B, } )
t2_2 = datetime.datetime.now()
#---------------------------------------------------------<SECTION-UNDER-TEST>

#------------------------------------------------------------------
_ = aClk_E2E.stop()
#------------------------------------------------------------------
print( "Distributed Computation time: " + str(t2_2 - t1_2))
print( "Distributed Experiment  took: {0: > 16d} [us] End-2-End.".format( _ ) )

Answer 1

分布式计算是我们新的Universe，或者是一组并行的

进入该领域的第一步始终具有挑战性。 丢失的确定性，在单一计算计算机之前的经验，许多新的挑战，在单节点过程协调中没有类似的问题，从分布式执行时间和分布的新数量级的许多新的惊喜中被认为是合理的（协调，如果不是死锁和/或实时锁定阻塞问题。

感谢您添加一些定量事实~15秒对于A[1000,1000];B[1000,1000];n=5来说太“太长” A[1000,1000];B[1000,1000];n=5 - 到目前为止一直很好。

您是否介意添加上面提出的代码更改并在同一个真实基础架构上重新运行实验？

这将有助于其余的工作开始（ WIP在这里 ）。

- 感谢提前运行+发布更新的事实。

用定量支持的陈述很难继续ATM，但是，我的直觉感到怀疑是在这一个：

 def matpow( M, n ):
     return  M if ( n < 1 ) else tf.matmul( M, matpow( M, n - 1 ) )

它使用递归，这可能对GPU交叉编译器/汇编器分析器来说太深，并且给出了张量的尺度，对于数学上密集的内核微码“快速”的GPU SMX，SMX本地SM_registers（具有约的延迟）一个22 GPU_CLK （好吧，可能只有8个，如果经过智能优化以从LRU对齐的SM_L1缓存线获取）将不得不溢出到global_MEMORY（因为每个SM有机会存储方式少于a 1 KB的重复内存访问延迟 - 最友好的SM_Registers，但matmul（）永远不会重复使用矩阵的任何单元，因此延迟隐藏永远不会比每个global_MEMORY访问付出更少+ [PSPACE]缩放..），其中突然出现约600多 GPU_CLK小号延迟处罚。

虽然这个叙述HPC matmul()动画提到了典型的CPU / Lx-cache / Memory层次结构，但是为什么任何O(N^3)处理必须在GPU上获得非常慢的速度，因为N大于适合SM_registers的N.容易看见（并且如你所见，想象所有缓存友好性在递归matpow()丢失）。

GPU内核可以获得小规模静态SM局部卷积的最佳结果（这个SMX位置允许良好的[数据：SMX本地SM_REG]对齐（并且需要进行零交叉SMX通信）在matmul（）处理的情况并非如此，只有大约7 x 7矩阵，这可能适合SM_REG-silicon而且上面的任何东西必须启动超级智能模板对齐体操，如果努力支付最低必要的总和-GPU-local memIO-latencies（如果发生糟糕的host2dev / dev2host IO，那么故事就会向前发展，还有更多的地方，执行性能无法控制地变得很差））。

与典型的图像专用内核相比，即使单个matmul( A, A )的延迟成本也突然变得非常低落。 （当然，有一些先进的技术，如何绕过这种专门的硅限制，但即使matmul()本来是顶级HPC块矩阵运算的主人，它也将是一个naive-matmul()一旦递归调用进入舞台 - 这将杀死智能技巧，因为“堆栈” - 中间值没有[SPACE] ，自动生成的内核代码将为此付出巨大的代价[TIME]惩罚..即使是如此小的尺度，如1000x1000）。

   Category                     GPU
   |                            Hardware
   |                            Unit
   |                            |            Throughput
   |                            |            |               Execution
   |                            |            |               Latency
   |                            |            |               |                  PTX instructions                                                      Note 
   |____________________________|____________|_______________|__________________|_____________________________________________________________________|________________________________________________________________________________________________________________________
   Load_shared                  LSU          2               +  30              ld, ldu                                                               Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
   Load_global                  LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_local                   LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_const                   LSU          2               + 600              ld, ldu                                                               Note, .ss = .const; .vec and .type determine the size of load
   Load_param                   LSU          2               +  30              ld, ldu                                                               Note, .ss = .param; .vec and .type determine the size of load
   |                            |                              
   Store_shared                 LSU          2               +  30              st                                                                    Note, .ss = .shared; .vec and .type determine the size of store
   Store_global                 LSU          2               + 600              st                                                                    Note, .ss = .global; .vec and .type determine the size of store
   Store_local                  LSU          2               + 600              st                                                                    Note, .ss = .local; .vec and .type determine the size of store
   Read_modify_write_shared     LSU          2               + 600              atom, red                                                             Note, .space = shared; .type determine the size
   Read_modify_write_global     LSU          2               + 600              atom, red                                                             Note, .space = global; .type determine the size
   |                            |                              
   Texture                      LSU          2               + 600              tex, txq, suld, sust, sured, suq
   |                            |                              
   Integer                      ALU          2               +  24              add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
   |                            |                                                                                                                     Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
   |                            |                              
   Float_single                 ALU          2               +  24              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-single inst. with type = { .f32 };
   Float_double                 ALU          1               +  48              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-double inst. with type = { .f64 };
   Special_single               SFU          8               +  48              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-single with type = { .f32 };
   Special_double               SFU          8               +  72              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-double with type = { .f64 };
   |                                                           
   Logical                      ALU          2               +  24              and, or, xor, not, cnot, shl, shr
   Control                      ALU          2               +  24              bra, call, ret, exit
   |                                                           
   Synchronization              ALU          2               +  24              bar, member, vote
   Compare & Select             ALU          2               +  24              set, setp, selp, slct
   |                                                           
   Conversion                   ALU          2               +  24              Isspacep, cvta, cvt
   Miscellanies                 ALU          2               +  24              brkpt, pmevent, trap
   Video                        ALU          2               +  24              vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset

为什么分布式TensorFlow玩具示例需要太长时间？

问题描述

1 个解决方案

解决方案1
-1 2017-08-18 10:55:54

分布式计算是我们新的Universe，或者是一组并行的

为什么分布式TensorFlow玩具示例需要太长时间？

问题描述

1 个解决方案

解决方案1 -1 2017-08-18 10:55:54

分布式计算是我们新的Universe，或者是一组并行的

解决方案1
-1 2017-08-18 10:55:54