简体   繁体   中英

Keras tuner and TPU in Google Colab

I have some problems with keras tuner and tpu. When I run the code below, everything works well and.network training is fast.

vocab_size = 5000
embedding_dim = 64
max_length = 2000

def create_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size, embedding_dim),
      tf.keras.layers.LSTM(100, dropout=0.5, recurrent_dropout=0.5),
      tf.keras.layers.Dense(embedding_dim, activation='relu'),
      tf.keras.layers.Dense(4, activation='softmax')
  ])
  return model

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])

model.fit(train_padded, y_train,
          epochs=10,
          validation_split=0.15,
          verbose=1, batch_size=128)

When I use a keras tuner, the neural.network learns slowly. I believe that TPU is not used.

vocab_size = 5000
max_length = 2000
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

def build_model(hp):
    model = tf.keras.Sequential()
    activation_choice = hp.Choice('activation', values=['relu', 'sigmoid', 'tanh', 'elu', 'selu'])
    embedding_dim = hp.Int('units_hidden', min_value=128, max_value=24, step=8)
    model.add(tf.keras.layers.Embedding(vocab_size, embedding_dim))
    model.add(tf.keras.layers.LSTM(hp.Int('LSTM_Units', min_value=50, max_value=500, step=10), 
                                  dropout=hp.Float('dropout', 0, 0.5, step=0.1, default=0), 
                                  recurrent_dropout=hp.Float('recurrent_dropout', 0, 0.5, step=0.1, default=0)))
    model.add(tf.keras.layers.Dense(embedding_dim, activation=activation_choice))
    model.add(tf.keras.layers.Dense(4, activation='softmax'))
    model.compile(
        optimizer=hp.Choice('optimizer', values=['adam', 'rmsprop', 'SGD']),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=['sparse_categorical_accuracy'])
    return model


with strategy.scope():
    tuner = Hyperband(
      build_model,
      objective='val_accuracy',
      max_epochs=10,
      hyperband_iterations=2)
    tuner.search(train_padded, y_train,
             batch_size=128,
             epochs=10,
             callbacks=[EarlyStopping(patience=1)],
             validation_split=0.15,
             verbose=1)

best_models = tuner.get_best_models(1)
best_model.save('/content/drive/My Drive/best_model.h5')

Notebook link

How to make a keras tuner work with TPU?

You need to pass it to the tuner:

tuner = Hyperband(
      build_model,
      objective='val_accuracy',
      max_epochs=10,
      hyperband_iterations=2,
      distribution_strategy=strategy,)

(and remove the strategy.scope() part)

To add...

I don't use Google Colab, but Kaggle. Using TPU, I get that same error "File system scheme '[local]' not implemented", when the tuner tries to write the checkpoints on Kaggle's working directory.

Since I don't have a gs://location, I just "modified" the function called by Keras Tuner to save checkpoints, to allow writing to local dir, which is the Kaggle working directory. I used patch() to mock the function.

First important thing is that Keras Tuner must be version 1.1.2 and above.

Example:

from mock import patch

<your code>

# now the new function to "replace" the existing one (keras_tuner.engine.tuner_utils.SaveBestEpoch.on_epoch_end)

    def new_on_epoch_end(self, epoch, logs=None):
        if not self.objective.has_value(logs):
            # Save on every epoch if metric value is not in the logs. Either no
            # objective is specified, or objective is computed and returned
            # after `fit()`.

            #***** the following are the lines I added ******************************************
            # Save model in Tensorflow's "SavedModel" format
            
            save_locally = tf.saved_model.SaveOptions(experimental_io_device = '/job:localhost')
            
            # I then added ', options = save_locally' to the line below.
            #************************************************************************************
        
            self.model.save_weights(self.filepath, options = save_locally)
            return
        current_value = self.objective.get_value(logs)
        if self.objective.better_than(current_value, self.best_value):
            self.best_value = current_value
            
            #***** the following are the lines I added ******************************************
            # Save model in Tensorflow's "SavedModel" format
            
            save_locally = tf.saved_model.SaveOptions(experimental_io_device = '/job:localhost')
            
            # I then added ', options = save_locally' to the line below.
            #************************************************************************************
            
            self.model.save_weights(self.filepath, options = save_locally)    
    

    with patch('keras_tuner.engine.tuner_utils.SaveBestEpoch.on_epoch_end', new_on_epoch_end):
        # Perform hypertuning.  The parameters are exactly like those in the fit() method.
        tuner.search(
            X_train,
            y_train,
            epochs=num_of_epochs,
            validation_data = (X_valid, y_valid), 
            callbacks=[early_stopping]   
            )

<more of your code>

Since I used 'with patch', after all is done, it reverts back to the original code automatically.

I hope this will be useful for those using Kaggle, or those who want to write to a local dir.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM