Tensorflow Keras Estimator 在底層模型工作時在回歸任務上失敗

Question

我將卷積神經網絡用於回歸任務（即網絡的最后一層有一個具有線性激活的神經元），並且效果很好（足夠了）。 當我嘗試使用與tf.keras.estimator.model_to_estimator打包的完全相同的模型時，估計器似乎適合，但訓練損失很快停止減少。 裸 keras 模型的最終 eval 損失（每個 4 個時期后）約為 0.4（平均絕對誤差），估計器約為 2.5（平均絕對誤差）。

為了演示這個問題，我將我的模型以裸露和估計器打包的形式應用於 MNIST 數據集（我知道 MNIST 是一項分類任務，將其作為回歸任務進行處理並沒有真正意義。該示例應該仍然說明我的觀點。）

我覺得非常令人驚訝的是，當使用相同的方式將分類神經網絡打包到估計器中時，裸 keras 模型及其打包的估計器版本表現同樣出色（下面的示例代碼中不包含分類案例）。 差異僅發生在回歸任務中。 我希望我要么錯過了一些非常基本的東西，要么這種行為是由於 Tensorflow 中的一些錯誤造成的。

為了確保模型的輸入之間的差異盡可能少，我將 MNIST 打包為tf.data.Dataset並從輸入函數中返回它，該函數將傳遞給估算器。 對於裸 Keras 模型，我使用相同的輸入函數獲取tf.Data.dataset並將其直接提供給fit函數。

# python 3.6. Tested with tensorflow-gpu-1.14 and tensorflow-cpu-2.0
import tensorflow as tf
import numpy as np


def get_model(IM_WIDTH=28, num_color_channels=1):
    """Create a very simple convolutional neural network using a tf.keras Functional Model."""
    input = tf.keras.Input(shape=(IM_WIDTH, IM_WIDTH, num_color_channels))
    x = tf.keras.layers.Conv2D(32, 3, activation='relu')(input)
    x = tf.keras.layers.MaxPooling2D(3)(x)
    x = tf.keras.layers.Conv2D(64, 3, activation='relu')(x)
    x = tf.keras.layers.MaxPooling2D(3)(x)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(64, activation='relu')(x)
    output = tf.keras.layers.Dense(1, activation='linear')(x)
    model = tf.keras.Model(inputs=[input], outputs=[output])
    model.compile(optimizer='adam', loss="mae",
                  metrics=['mae'])
    model.summary()
    return model


def input_fun(train=True):
    """Load MNIST and return the training or test set as a tf.data.Dataset; Valid input function for tf.estimator"""
    (train_images, train_labels), (eval_images, eval_labels) = tf.keras.datasets.mnist.load_data()
    train_images = train_images.reshape((60_000, 28, 28, 1)).astype(np.float32) / 255.
    eval_images = eval_images.reshape((10_000, 28, 28, 1)).astype(np.float32) / 255.
    # train_labels = train_labels.astype(np.float32)  # these two lines don't affect behaviour.
    # eval_labels = eval_labels.astype(np.float32)
    # For a neural network with one neuron in the final layer, it doesn't seem to matter if target data is float or int.

    if train:
        dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
        dataset = dataset.shuffle(buffer_size=100).repeat(None).batch(32).prefetch(1)
    else:
        dataset = tf.data.Dataset.from_tensor_slices((eval_images, eval_labels))
        dataset = dataset.batch(32).prefetch(1)  # note: prefetching does not affect behaviour

    return dataset


model = get_model()
train_input_fn = lambda: input_fun(train=True)
eval_input_fn = lambda: input_fun(train=False)

NUM_EPOCHS, STEPS_PER_EPOCH = 4, 1875  # 1875 = number_of_train_images(=60.000)  /  batch_size(=32)
USE_ESTIMATOR = False  # change this to compare model/estimator. Estimator performs much worse for no apparent reason
if USE_ESTIMATOR:
    estimator = tf.keras.estimator.model_to_estimator(
        keras_model=model, model_dir="model_directory",
        config=tf.estimator.RunConfig(save_checkpoints_steps=200, save_summary_steps=200))

    train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=STEPS_PER_EPOCH * NUM_EPOCHS)
    eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn, throttle_secs=0)

    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    print("Training complete. Evaluating Estimator:")
    print(estimator.evaluate(eval_input_fn))
    # final train loss with estimator: ~2.5 (mean abs. error).
else:
    dataset = train_input_fn()
    model.fit(dataset, steps_per_epoch=STEPS_PER_EPOCH, epochs=NUM_EPOCHS)
    print("Training complete. Evaluating Keras model:")
    print(model.evaluate(eval_input_fn()))
    # final train loss with Keras model: ~0.4 (mean abs. error).

Answer 1

我提供的答案與我在 GitHub 中提供的答案相同。

我同意你的看法，如果我們使用TF1.15 ，模型和估計器的結果之間存在顯着差異。 我認為TF1.15分支可能不會有更多更新。 如果有任何與安全相關的問題，那么只會更新TF1.15分支。

我用tf-nightly運行了你的代碼。 我沒有看到模型和估計器的輸出之間有任何顯着差異。

以下是模型的輸出（USE_ESTIMATOR = False）

Training complete. Evaluating Keras model:
313/313 [==============================] - 2s 7ms/step - loss: 0.4018 - mae: 0.4021
[0.4018059968948364, 0.4020615816116333]

以下是估算器的輸出（USE_ESTIMATOR = True）

Training complete. Evaluating Estimator:
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-18T23:15:15Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from model_directory/model.ckpt-7500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 2.14818s
INFO:tensorflow:Finished evaluation at 2020-03-18-23:15:17
INFO:tensorflow:Saving dict for global step 7500: global_step = 7500, loss = 0.39566746, mae = 0.39566746
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 7500: model_directory/model.ckpt-7500
{'loss': 0.39566746, 'mae': 0.39566746, 'global_step': 7500}

Answer 2

我在https://github.com/tensorflow/tensorflow/issues/35833#issue-549185982做了一個錯誤報告

為避免討論在網站之間分散，我將此主題標記為已解決。

Tensorflow Keras Estimator 在底層模型工作時在回歸任務上失敗

問題描述

2 個解決方案

解決方案1
1 2020-03-27 22:52:20

解決方案2
0 已采納 2020-01-13 21:19:23

Tensorflow Keras Estimator 在底層模型工作時在回歸任務上失敗

問題描述

2 個解決方案

解決方案1 1 2020-03-27 22:52:20

解決方案2 0 已采納 2020-01-13 21:19:23

解決方案1
1 2020-03-27 22:52:20

解決方案2
0 已采納 2020-01-13 21:19:23