Following the example here one can create a tf.Estimator
from an existing keras
model. In the beginning this page states that by doing so, one can use benefits of the tf.Estimator
such as an increased training speed due to distributed training. Sadly, when I run the code only one of the GPUs in my system is used for computations; therefore, there is no increase of speed. How exactly can I use distributed learning with an estimator built from a keras
model?
I stumbpled upon this method:
distributed_model = tf.keras.utils.multi_gpu_model(model, gpus=2)
which sounds like it would take care of this problem. But this is not working at the moment, as it creates a graph which uses the get_slice(..)
method defined in tensorflow/python/keras/_impl/keras/utils/training_utils.py
and this method fails with the following error message:
Traceback (most recent call last): File "hub.py", line 75, in <module>
estimator = create_model_estimator() File "hub.py", line 67, in create_model_estimator
estimator = tf.keras.estimator.model_to_estimator(keras_model=new_model, custom_objects={'tf': tf}, model_dir=model_dir, config=run_config) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 302, in model_to_estimator
_save_first_checkpoint(keras_model, est, custom_objects, keras_weights) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 231, in _save_first_checkpoint
custom_objects) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 109, in _clone_and_build_model
model = models.clone_model(keras_model, input_tensors=input_tensors) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/models.py", line 1557, in clone_model
return _clone_functional_model(model, input_tensors=input_tensors) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/models.py", line 1451, in _clone_functional_model
output_tensors = topology._to_list(layer(computed_tensor, **kwargs)) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/topology.py", line 258, in __call__
output = super(Layer, self).__call__(inputs, **kwargs) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 696, in __call__
outputs = self.call(inputs, *args, **kwargs) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/layers/core.py", line 630, in call
return self.function(inputs, **arguments) File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/utils/training_utils.py", line 156, in get_slice
shape = array_ops.shape(data) NameError: name 'array_ops' is not defined
So, what can I do to use both of my GPUs to train a model with a tf.Estimator
object?
Edit : By switching the version/build of tensorflow
I was able to get rid of the previous error message, but now I get this one:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value res2a_branch2c/bias
[[Node: res2a_branch2c/bias/_482 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1142_res2a_branch2c/bias", _device="/job:localhost/replica:0/task:0/device:GPU:0"](res2a_branch2c/bias)]]
[[Node: bn4a_branch2a/beta/_219 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_878_bn4a_branch2a/beta", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Maybe this is connected to this issue ?
you should setting distributed running config.
you can reference this demo for tensorflow high level API(estimator) for distributed training.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.