简体   繁体   English

Tensorflow Adam Multigpu渐变

[英]Tensorflow Adam Multigpu Gradient

I am trying to implement a network multiple gpu on tensorflow using ADAM Optimization. 我正在尝试使用ADAM优化在Tensorflow上实现网络多个GPU。

I am coping the code from the Cifar10_multigpu, but it looks that when the gradient calls the second tower it calls the gradient of the first and generated error on the average of the two towers. 我正在处理Cifar10_multigpu中的代码,但看起来当梯度调用第二个塔时,它将调用第一个塔的梯度,并且在两个塔的平均值上生成了误差。 the code of the two tower is this 两塔的密码是这个

  for d in devs: with tf.device(d): with tf.name_scope('%s_%d' % (tf_model.TOWER_NAME, i)) as scope: loss = tower_loss(scope) tf.get_variable_scope().reuse_variables() summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope) grads = opt.compute_gradients(loss) print('\\n'.join('{}: {}'.format(*k) for k in enumerate(grads))) tower_grads.append(grads) i +=1 

and this generates each tower: 这将生成每个塔:

stream, target= placeholder_inputs(FLAGS.batch_size*tf_model.ANGLES/FLAGS.num_gpus)
    logits = tf_model.inference_noisy_simulate(stream)
    _ = tf_model.loss(logits, target)
    losses = tf.get_collection('losses', scope)
    total_loss = tf.add_n(losses, name='total_loss')

looking at the gradients the first tower generate this: 看一下第一座塔楼产生的梯度:

0: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca11d0ae10>)
1: (<tf.Tensor 'tower_0/gradients/tower_0/conv1/Conv2D_grad/tuple/control_dependency_1:0' shape=(1, 1, 8, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351b10>)
2: (<tf.Tensor 'tower_0/gradients/tower_0/conv1/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c380dd0>)
3: (<tf.Tensor 'tower_0/gradients/tower_0/conv2/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351a10>)
4: (<tf.Tensor 'tower_0/gradients/tower_0/conv2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6dd0>)
5: (<tf.Tensor 'tower_0/gradients/tower_0/conv3/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 32) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6490>)
6: (<tf.Tensor 'tower_0/gradients/tower_0/conv3/BiasAdd_grad/tuple/control_dependency_1:0' shape=(32,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351990>)
7: (<tf.Tensor 'tower_0/gradients/tower_0/conv4/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 32, 64) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351890>)
8: (<tf.Tensor 'tower_0/gradients/tower_0/conv4/BiasAdd_grad/tuple/control_dependency_1:0' shape=(64,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3b7790>)
9: (<tf.Tensor 'tower_0/gradients/tower_0/conv5/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 64, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2d9110>)
10: (<tf.Tensor 'tower_0/gradients/tower_0/conv5/BiasAdd_grad/tuple/control_dependency_1:0' shape=(128,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2849d0>)
11: (<tf.Tensor 'tower_0/gradients/tower_0/conv6/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 128, 256) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2e6f10>)
12: (<tf.Tensor 'tower_0/gradients/tower_0/conv6/BiasAdd_grad/tuple/control_dependency_1:0' shape=(256,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2afed0>)
13: (<tf.Tensor 'tower_0/gradients/tower_0/fc1/MatMul_grad/tuple/control_dependency_1:0' shape=(18944, 4096) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1f9550>)
14: (<tf.Tensor 'tower_0/gradients/tower_0/fc1/add_grad/tuple/control_dependency_1:0' shape=(4096,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c214a10>)
15: (<tf.Tensor 'tower_0/gradients/tower_0/fc1_1/MatMul_grad/tuple/control_dependency_1:0' shape=(4096, 1024) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c23dfd0>)
16: (<tf.Tensor 'tower_0/gradients/tower_0/fc1_1/add_grad/tuple/control_dependency_1:0' shape=(1024,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c269bd0>)
17: (<tf.Tensor 'tower_0/gradients/tower_0/softmax_linear/MatMul_grad/tuple/control_dependency_1:0' shape=(1024, 360) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1d1a50>)
18: (<tf.Tensor 'tower_0/gradients/tower_0/softmax_linear/softmax_linear_grad/tuple/control_dependency_1:0' shape=(360,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1def50>)

and the second generates this; 第二个产生这个;

0: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca11d0ae10>)
1: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351b10>)
2: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c380dd0>)
3: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351a10>)
4: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6dd0>)
5: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6490>)
6: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351990>)
7: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351890>)
8: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3b7790>)
9: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2d9110>)
10: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2849d0>)
11: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2e6f10>)
12: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2afed0>)
13: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1f9550>)
14: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c214a10>)
15: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c23dfd0>)
16: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c269bd0>)
17: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1d1a50>)
18: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1def50>)
19: (<tf.Tensor 'tower_1/gradients/tower_1/conv1/Conv2D_grad/tuple/control_dependency_1:0' shape=(1, 1, 8, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c178c50>)
20: (<tf.Tensor 'tower_1/gradients/tower_1/conv1/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfbb490>)
21: (<tf.Tensor 'tower_1/gradients/tower_1/conv2/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfda950>)
22: (<tf.Tensor 'tower_1/gradients/tower_1/conv2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf91bd0>)
23: (<tf.Tensor 'tower_1/gradients/tower_1/conv3/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 32) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfcb590>)
24: (<tf.Tensor 'tower_1/gradients/tower_1/conv3/BiasAdd_grad/tuple/control_dependency_1:0' shape=(32,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf39e90>)
25: (<tf.Tensor 'tower_1/gradients/tower_1/conv4/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 32, 64) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf499d0>)
26: (<tf.Tensor 'tower_1/gradients/tower_1/conv4/BiasAdd_grad/tuple/control_dependency_1:0' shape=(64,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf14fd0>)
27: (<tf.Tensor 'tower_1/gradients/tower_1/conv5/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 64, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf39150>)
28: (<tf.Tensor 'tower_1/gradients/tower_1/conv5/BiasAdd_grad/tuple/control_dependency_1:0' shape=(128,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebd8d0>)
29: (<tf.Tensor 'tower_1/gradients/tower_1/conv6/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 128, 256) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf23110>)
30: (<tf.Tensor 'tower_1/gradients/tower_1/conv6/BiasAdd_grad/tuple/control_dependency_1:0' shape=(256,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf04610>)
31: (<tf.Tensor 'tower_1/gradients/tower_1/fc1/MatMul_grad/tuple/control_dependency_1:0' shape=(18944, 4096) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebdc50>)
32: (<tf.Tensor 'tower_1/gradients/tower_1/fc1/add_grad/tuple/control_dependency_1:0' shape=(4096,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebd310>)
33: (<tf.Tensor 'tower_1/gradients/tower_1/fc1_1/MatMul_grad/tuple/control_dependency_1:0' shape=(4096, 1024) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be96e10>)
34: (<tf.Tensor 'tower_1/gradients/tower_1/fc1_1/add_grad/tuple/control_dependency_1:0' shape=(1024,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be96990>)
35: (<tf.Tensor 'tower_1/gradients/tower_1/softmax_linear/MatMul_grad/tuple/control_dependency_1:0' shape=(1024, 360) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be52c90>)
36: (<tf.Tensor 'tower_1/gradients/tower_1/softmax_linear/softmax_linear_grad/tuple/control_dependency_1:0' shape=(360,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf56f50>)

I am wondering how to remove from the second the first None, but no targeting indexes so i can make for more towers. 我想知道如何从第二个中删除第一个None,但是没有定位索引,以便我可以制造更多塔。

I already find the error. 我已经找到错误了。 I was using a trainable variable for the learning rate(I wanted to trak the lr but it does not look possible), and also add the list of variables to compute by the op at adam. 我使用了一个可训练的变量来提高学习速度(我想对lr进行跟踪,但看起来不太可能),并且还添加了变量列表,以供亚当操作员计算。 I am not sure if this is a correct way but it looks like it works. 我不确定这是否是正确的方法,但看起来可行。

with tf.Graph().as_default(), tf.device('/cpu:0'):
        devs = ['/job:prs/task:0/gpu:0','/job:worker/task:0/gpu:0'] #
        global_step = tf.get_variable('global_step', [], initializer=tf.constant_initializer(0), trainable=False)
        num_batches_per_epoch = dt_fdr.FLS_PER_ANGLE/ FLAGS.batch_size
        #lr = tf.Variable(tf.constant(FLAGS.learning_rate, dtype=tf.float32))
        opt = tf.train.AdamOptimizer(FLAGS.learning_rate)
        tower_grads = []
        for i in xrange(FLAGS.num_gpus):
            with tf.device(devs[i]):
                with tf.name_scope('%s_%d' % (tf_model.TOWER_NAME, i)) as scope:
                    loss = tower_loss(scope)
                    tf.get_variable_scope().reuse_variables()
                    summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
                    #"print('\n'.join('{}: {}'.format(*k) for k in enumerate(summaries)))
                    grads = opt.compute_gradients(loss, tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope))
                    #print('\n'.join('{}: {}'.format(*k) for k in enumerate(grads)))
                    tower_grads.append(grads)
        grads = average_gradients(tower_grads)
        #summaries.append(tf.scalar_summary('learning_rate', lr))
        for grad, var in grads:
            if grad:
                summaries.append(
                    tf.histogram_summary(var.op.name + '/gradients', grad))
        apply_gradient_op = opt.apply_gradients(grads, global_step=global_step) 
        for var in tf.trainable_variables():
            summaries.append(tf.histogram_summary(var.op.name, var))

        train_op = apply_gradient_op

        saver = tf.train.Saver(tf.all_variables())

        summary_op = tf.merge_summary(summaries)

        init = tf.initialize_all_variables()

        sess = tf.Session("grpc://nelson-lab:2500",config=tf.ConfigProto(
            allow_soft_placement=True,
            log_device_placement=FLAGS.log_device_placement))
        sess.run(init)

I wonder if someone also tried to do some double gpu training using adam. 我想知道是否有人也尝试使用亚当做一些双gpu训练。

Regards 问候

Additional 额外

If you have a plan to update learning rate in training phase, declare like below. 如果您有计划在培训阶段更新学习率,请声明如下。

lr = tf.Variable(FLAGS.learning_rate, trainable=False)
opt = tf.train.AdamOptimizer(lr)

After that 之后

sess.run(tf.assign(lr, new_lr))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM