I'm trying implement this Distributed Keras Tuner example on the Google Cloud Platform (GCP) ML Engine (aka AI Platform): https://github.com/keras-team/keras-tuner/blob/master/docs/templates/tutorials/distributed-tuning.md
Here's is my ML Training input .yaml:
scaleTier : CUSTOM
masterType: standard
masterConfig:
imageUri: tensorflow/tensorflow:2.1.0-gpu-py3
workerCount: 8
workerType: standard_gpu
workerConfig:
imageUri: tensorflow/tensorflow:2.1.0-gpu-py3
At the top of the python script, I add:
tf_config = json.loads(os.environ['TF_CONFIG'])
cluster = tf_config['cluster']
task = tf_config['task']
master_addr = cluster['master'][0].split(':')
os.environ['KERASTUNER_ORACLE_IP'] = master_addr[0]
os.environ['KERASTUNER_ORACLE_PORT'] = '8000'
if task['type'] == 'master':
os.environ['KERASTUNER_TUNER_ID'] = 'chief'
else:
os.environ['KERASTUNER_TUNER_ID'] = 'tuner{}'.format(task['index'])
Unfortunately, this does not work. The master returns the error:
server_chttp2.cc:40] {"created":"@1580940408.588629852","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1580940408.588623412","description":"Unable to configure socket","fd":22,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":208,"referenced_errors":[{"created":"@1580940408.588609041","description":"Cannot assign requested address","errno":99,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Cannot assign requested address","syscall":"bind"}]}]}
Thus it appears that the master is not able to bind to the listening port.
So, I suppose the real question is: How to bind to a listen port on GCP ML Engine? Is this allowed?
Any insight now on how to run distributed Keras Tuning on GCP ML Engine is appreciated.
I had similar problem as OP, as far as the error message says. I am not certain what is the real cause, but the work around that works for me is to bind 0.0.0.0 (ie os.environ['KERASTUNER_ORACLE_IP'] = '0.0.0.0'
) for the chief while still using chief IP that come from TF_CONFIG for workers
KERASTUNER_ORACLE_IP expects IP address, not a hostname.
Here is the function I used in my project, see https://github.com/vlasenkoalexey/gcp_runner/blob/master/entry_point.ipynb
import os
import json
import socket
def setup_keras_tuner_config():
if 'TF_CONFIG' in os.environ:
try:
tf_config = json.loads(os.environ['TF_CONFIG'])
cluster = tf_config['cluster']
task = tf_config['task']
chief_addr = cluster['chief'][0].split(':')
chief_ip = socket.gethostbyname(chief_addr[0])
chief_port = chief_addr[1]
os.environ['KERASTUNER_ORACLE_IP'] = chief_ip
os.environ['KERASTUNER_ORACLE_PORT'] = chief_port
if task['type'] == 'chief':
os.environ['KERASTUNER_TUNER_ID'] = 'chief'
else:
os.environ['KERASTUNER_TUNER_ID'] = 'tuner{}'.format(task['index'])
print('set following environment arguments:')
print('KERASTUNER_ORACLE_IP: %s' % os.environ['KERASTUNER_ORACLE_IP'])
print('KERASTUNER_ORACLE_PORT: %s' % os.environ['KERASTUNER_ORACLE_PORT'])
print('KERASTUNER_TUNER_ID: %s' % os.environ['KERASTUNER_TUNER_ID'])
except Exception as ex:
print('Error setting up keras tuner config: %s' % str(ex))
And for TF2.x 'master' was replaced by 'chief' in TF_CONFIG. You can pass --use-chief-in-tf-config to have it updated. Confirmed that it works on Google AI platform and on Kubernetes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.