简体   繁体   中英

Why a distributed TensorFlow toy-example takes too long?

I tried running to run a toy example for some matrix multiplication and addition using distributed TensorFlow.

My goal was to compute (A^n + B^n) where A[,] and B[,] are LxL matrices.

I used 2 machines on the public cloud to compute A^n on one machine and B^n on the second machine, than the addition again on the first machine.

When the machines had CPU only - my script worked great.
When both had GPU - It failed to run in a reasonable time! It had a huge latency...

My question - what did I do wrong in my script?

Note that for machine2 ( task:1 ) I used server.join() and I used machine 1 ( task:0 ) as the client in this with-in graph.

#------------------------------------------------------------------
from zmq import Stopwatch; aClk_E2E = Stopwatch(); aClk_E2E.start()
#------------------------------------------------------------------
from __future__ import print_function
import numpy as np
import tensorflow as tf
import datetime

IP_1 = '10.132.0.2';     port_1 = '2222'
IP_2 = '10.132.0.3';     port_2 = '2222'

cluster = tf.train.ClusterSpec( { "local": [ IP_1 + ":" + port_1,
                                             IP_2 + ":" + port_2
                                             ],
                                   }
                                )
server = tf.train.Server( cluster,
                          job_name   = "local",
                          task_index = 0
                          )
# server.join() # @machine2 ( task:1 )

n =    5
L = 1000  

def matpow( M, n ):
    if n < 1:                 # Abstract cases where n < 1
        return M
    else:
        return tf.matmul( M, matpow( M, n - 1 ) )

G = tf.Graph()

with G.as_default():
     with tf.device( "/job:local/task:1/cpu:0" ):
          c1 = []
          tB = tf.placeholder( tf.float32, [L, L] )     # tensor B placeholder
          with tf.device( "/job:local/task:1/gpu:0" ):
               c1.append( matpow( tB, n ) )

     with tf.device( "/job:local/task:0/cpu:0" ):
          c2 = []
          tA = tf.placeholder( tf.float32, [L, L] )     # tensor A placeholder
          with tf.device( "/job:local/task:0/gpu:0" ):
               c2.append( matpow( tA, n ) )
          sum2 = tf.add_n( c1 + c2 )
#---------------------------------------------------------<SECTION-UNDER-TEST>
t1_2 = datetime.datetime.now()
with tf.Session( "grpc://" + IP_1 + ":" + port_1, graph = G ) as sess:
     A = np.random.rand( L, L ).astype( 'float32' )
     B = np.random.rand( L, L ).astype( 'float32' )
     sess.run( sum2, { tA: A, tB: B, } )
t2_2 = datetime.datetime.now()
#---------------------------------------------------------<SECTION-UNDER-TEST>

#------------------------------------------------------------------
_ = aClk_E2E.stop()
#------------------------------------------------------------------
print( "Distributed Computation time: " + str(t2_2 - t1_2))
print( "Distributed Experiment  took: {0: > 16d} [us] End-2-End.".format( _ ) )

Distributed computing is our new Universe, or a set of parallel ones

A first step into this domain is always challenging. Lost certainties, that were taken as granted in monolythic computing previous experience, many new challenges, that had no similar issues in single-node process coordinations, many new surprises from new orders of magnitude of distributed execution timing- and distributed ( coordination, if not dead-lock and/or live-lock ) blocking-issues.

Thanks for added some quantitative facts ~ 15 seconds being "too-long" for A[1000,1000];B[1000,1000];n=5 -- so far so good.


Would you mind to add the above proposed code-changes and re-run the experiment on the same real-infrastructure?

This will help the rest of this started work ( WIP here ).

-- THANKS IN ADVANCE to run + post updated facts.


It is hard to continue ATM with quantitatively supported statements, however, my gut feeling suspect is in this one:

 def matpow( M, n ):
     return  M if ( n < 1 ) else tf.matmul( M, matpow( M, n - 1 ) )

It uses a recursion, that may go too deep for GPU-cross-compiler/assembler analysers and given scales of tensors, the GPU SMX that are "fast" for mathematically dense-kernel microcode, the SMX-local SM_registers ( having latencies of about a 22 GPU_CLK s ( well, may just 8, if smart-optimised to get fetched from well LRU-aligned SM_L1-cache-line ) will have to spill-over to global_MEMORY ( as each SM has a chance to store way less than a 1 KB in repetitive memory-access latency-most-friendly SM_Registers, but matmul() does never re-use any cell of a matrix so latency-hiding will never pay less than each global_MEMORY access + the [PSPACE] scales .. ), where suddenly experience about a 600+ GPU_CLK s latency penalties.

While this narrated animation of HPC matmul() mentions typical CPU / Lx-cache / Memory hierarchy, the message why any O(N^3) processing must get incredibly slow on GPU-s for N going larger than could fit onto SM_registers is yet easily visible ( and as you see it, imagine all your cache-friendliness lost in recursion matpow() ).

GPU-kernels get best results for small-scale still-SM-local-convolutions ( where right this SMX-locality allows for good [Data:SMX-local SM_REG] alignment ( and zero cross-SMX-communication need to take place ( which is not the case in matmul() processing anywhere larger than just ~ 7 x 7 matrices, that could fit on SM_REG-silicon and anything above this has to start super-smart stencil-aligned gymnastics, if striving pay the minimum necessary sum of just-GPU-local memIO-latencies ( an the story goes forwards, if poor host2dev / dev2host IO-takes place, and many more places, where the execution-performance is uncontrollably poor ) ).

The latency costs even for a single matmul( A, A ) are suddenly hell-loooong, of compared to typical image-specialised kernels. ( Sure, there are advanced techniques, how to by-pass this specialised-silicon limits, but even if matmul() would have been a master of top tier HPC-block-matrix-ops, it will be lest as a naive-matmul() once recursive calls get on the stage -- that will kill even the smarted tricks, as there is no [SPACE] for the "stack"-intermediate-values, for which the auto-generated-kernel-code will pay in immense [TIME] penalties .. even for such small scales, like 1000x1000 ).

   Category                     GPU
   |                            Hardware
   |                            Unit
   |                            |            Throughput
   |                            |            |               Execution
   |                            |            |               Latency
   |                            |            |               |                  PTX instructions                                                      Note 
   |____________________________|____________|_______________|__________________|_____________________________________________________________________|________________________________________________________________________________________________________________________
   Load_shared                  LSU          2               +  30              ld, ldu                                                               Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
   Load_global                  LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_local                   LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_const                   LSU          2               + 600              ld, ldu                                                               Note, .ss = .const; .vec and .type determine the size of load
   Load_param                   LSU          2               +  30              ld, ldu                                                               Note, .ss = .param; .vec and .type determine the size of load
   |                            |                              
   Store_shared                 LSU          2               +  30              st                                                                    Note, .ss = .shared; .vec and .type determine the size of store
   Store_global                 LSU          2               + 600              st                                                                    Note, .ss = .global; .vec and .type determine the size of store
   Store_local                  LSU          2               + 600              st                                                                    Note, .ss = .local; .vec and .type determine the size of store
   Read_modify_write_shared     LSU          2               + 600              atom, red                                                             Note, .space = shared; .type determine the size
   Read_modify_write_global     LSU          2               + 600              atom, red                                                             Note, .space = global; .type determine the size
   |                            |                              
   Texture                      LSU          2               + 600              tex, txq, suld, sust, sured, suq
   |                            |                              
   Integer                      ALU          2               +  24              add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
   |                            |                                                                                                                     Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
   |                            |                              
   Float_single                 ALU          2               +  24              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-single inst. with type = { .f32 };
   Float_double                 ALU          1               +  48              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-double inst. with type = { .f64 };
   Special_single               SFU          8               +  48              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-single with type = { .f32 };
   Special_double               SFU          8               +  72              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-double with type = { .f64 };
   |                                                           
   Logical                      ALU          2               +  24              and, or, xor, not, cnot, shl, shr
   Control                      ALU          2               +  24              bra, call, ret, exit
   |                                                           
   Synchronization              ALU          2               +  24              bar, member, vote
   Compare & Select             ALU          2               +  24              set, setp, selp, slct
   |                                                           
   Conversion                   ALU          2               +  24              Isspacep, cvta, cvt
   Miscellanies                 ALU          2               +  24              brkpt, pmevent, trap
   Video                        ALU          2               +  24              vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM