TensorFlow 2.x: Cannot load trained model in h5 format when using embedding columns (ValueError: Shapes (101, 15) and (57218, 15) are incompatible)

Question

After long back and forth, I managed to save my model (see my question TensorFlow 2.x: Cannot save trained model in h5 format (OSError: Unable to create link (name already exists)) ). But now I have problems loading the saved model. First I got the following error by loading a model:

ValueError: You are trying to load a weight file containing 1 layers into a model with 0 layers.

After changing the sequential to the functional API I get the following error:

ValueError: Cannot assign to variable dense_features/NAME1W1_embedding/embedding_weights:0 due to variable shape (101, 15) and value shape (57218, 15) are incompatible

I tried different versions of TensorFlow. I got the described error in Version tf-nightly. In Version 2.1 I got a quite similar error:

ValueError: Shapes (101, 15) and (57218, 15) are incompatible.

In version 2.2 and 2.3 I can't even save my model (as described in my previous question).

Here is the relevant code of the functional API:

def __loadModel(args):
    filepath = args.loadModel

    model = tf.keras.models.load_model(filepath)

    print("start preprocessing...")
    (_, _, test_ds) = preprocessing.getPreProcessedDatasets(args.data, args.batchSize)
    print("preprocessing completed")

    _, accuracy = model.evaluate(test_ds)
    print("Accuracy", accuracy)



def __trainModel(args):
    (train_ds, val_ds, test_ds) = preprocessing.getPreProcessedDatasets(args.data, args.batchSize)

    for bucketSizeGEO in args.bucketSizeGEO:
        print("start preprocessing...")
        feature_columns = preprocessing.getFutureColumns(args.data, args.zip, bucketSizeGEO, True)
        #Todo: compare trainable=False to trainable=True
        feature_layer = tf.keras.layers.DenseFeatures(feature_columns, trainable=False)
        print("preprocessing completed")


        feature_layer_inputs = preprocessing.getFeatureLayerInputs()
        feature_layer_outputs = feature_layer(feature_layer_inputs)
        output_layer = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(feature_layer_outputs)

        model = tf.keras.Model(inputs=[v for v in feature_layer_inputs.values()], outputs=output_layer)

        model.compile(optimizer='sgd',
            loss='binary_crossentropy',
            metrics=['accuracy'])

        paramString = "Arg-e{}-b{}-z{}".format(args.epoch, args.batchSize, bucketSizeGEO)


        log_dir = "logs\\logR\\" + paramString + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
        tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)


        model.fit(train_ds,
                validation_data=val_ds,
                epochs=args.epoch,
                callbacks=[tensorboard_callback])


        model.summary()

        loss, accuracy = model.evaluate(test_ds)
        print("Accuracy", accuracy)

        paramString = paramString + "-a{:.4f}".format(accuracy)

        outputName = "logReg" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + paramString

        

        if args.saveModel:
            for i, w in enumerate(model.weights): print(i, w.name)

            path = './saved_models/' + outputName + '.h5'
            model.save(path, save_format='h5')

For the relevant preprocessing part see the mentioned question at the beginning of this question. for i, w in enumerate(model.weights): print(i, w.name) returns the following:

0 dense_features/NAME1W1_embedding/embedding_weights:0
1 dense_features/NAME1W2_embedding/embedding_weights:0
2 dense_features/STREETW_embedding/embedding_weights:0
3 dense_features/ZIP_embedding/embedding_weights:0
4 dense/kernel:0
5 dense/bias:0

Answer 1

This problem is caused by the inconsistency between the dimension of emebedding matrix in training and prediction.

Usually, before we use the embedded matrix, we will form a dictionary. Here we temporarily call this dictionary word_index。 If the author of the code is not thoughtful, it will lead to two different words_index in training and prediction (because the data used in training and prediction are different), the dimension of emebedding matrix changes.

You can see from your bug that you get len (word_index) + 1 when you train is 57218, and len (word_index) + 1 is obtained during prediction is 101.

If we want to run the code correctly, we can't regenerate a word_index during prediction when we need to use the prediction of word_index. So the simplest solution to this problem is to save the word_index you get when you train, which is called at the time of prediction, so that we can correctly load the weight we get during training.

Answer 2

I was able to solve my rather stupid mistake:

I was using the feature_column library to preprocess my data. Unfortunately, I specified a fixed and not the actual size of the vocabulary list in the parameter num_buckets in the function categorical_column_with_identity. Wrong version:

street_voc = tf.feature_column.categorical_column_with_identity(
        key='STREETW', num_buckets=100)

Correct version:

street_voc = tf.feature_column.categorical_column_with_identity(
        key='STREETW', num_buckets= __getNumberOfWords(data, 'STREETPRO') + 1)

The function __getNumberOfWords(data, 'STREETPRO') returns the number of different words in the column 'STREETPRO' of the pandas dataframe.

TensorFlow 2.x: Cannot load trained model in h5 format when using embedding columns (ValueError: Shapes (101, 15) and (57218, 15) are incompatible)

Question

2 answers

solution1
1 2020-09-29 06:32:39

solution2
0 ACCPTED 2020-12-17 20:27:44

TensorFlow 2.x: Cannot load trained model in h5 format when using embedding columns (ValueError: Shapes (101, 15) and (57218, 15) are incompatible)

Question

2 answers

solution1 1 2020-09-29 06:32:39

solution2 0 ACCPTED 2020-12-17 20:27:44

solution1
1 2020-09-29 06:32:39

solution2
0 ACCPTED 2020-12-17 20:27:44