简体   繁体   中英

Distributed training using Keras model and tf.Estimator

Following the example here one can create a tf.Estimator from an existing keras model. In the beginning this page states that by doing so, one can use benefits of the tf.Estimator such as an increased training speed due to distributed training. Sadly, when I run the code only one of the GPUs in my system is used for computations; therefore, there is no increase of speed. How exactly can I use distributed learning with an estimator built from a keras model?

I stumbpled upon this method:

distributed_model = tf.keras.utils.multi_gpu_model(model, gpus=2)

which sounds like it would take care of this problem. But this is not working at the moment, as it creates a graph which uses the get_slice(..) method defined in tensorflow/python/keras/_impl/keras/utils/training_utils.py and this method fails with the following error message:

Traceback (most recent call last):   File "hub.py", line 75, in <module>
    estimator = create_model_estimator()   File "hub.py", line 67, in create_model_estimator
    estimator = tf.keras.estimator.model_to_estimator(keras_model=new_model, custom_objects={'tf': tf}, model_dir=model_dir, config=run_config)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 302, in model_to_estimator
    _save_first_checkpoint(keras_model, est, custom_objects, keras_weights)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 231, in _save_first_checkpoint
    custom_objects)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 109, in _clone_and_build_model
    model = models.clone_model(keras_model, input_tensors=input_tensors)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/models.py", line 1557, in clone_model
    return _clone_functional_model(model, input_tensors=input_tensors)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/models.py", line 1451, in _clone_functional_model
    output_tensors = topology._to_list(layer(computed_tensor, **kwargs))   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/topology.py", line 258, in __call__
    output = super(Layer, self).__call__(inputs, **kwargs)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 696, in __call__
    outputs = self.call(inputs, *args, **kwargs)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/layers/core.py", line 630, in call
    return self.function(inputs, **arguments)   File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/utils/training_utils.py", line 156, in get_slice
    shape = array_ops.shape(data) NameError: name 'array_ops' is not defined

So, what can I do to use both of my GPUs to train a model with a tf.Estimator object?

Edit : By switching the version/build of tensorflow I was able to get rid of the previous error message, but now I get this one:

Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value res2a_branch2c/bias
         [[Node: res2a_branch2c/bias/_482 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1142_res2a_branch2c/bias", _device="/job:localhost/replica:0/task:0/device:GPU:0"](res2a_branch2c/bias)]]
         [[Node: bn4a_branch2a/beta/_219 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_878_bn4a_branch2a/beta", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Maybe this is connected to this issue ?

you should setting distributed running config.

you can reference this demo for tensorflow high level API(estimator) for distributed training.

https://github.com/colinwke/wide_deep_demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM