简体   繁体   中英

Tensorflow + Keras training: InvalidArgumentError: Incompatible shapes: [7,128,2,2] vs [7,128,3,3]

Implementing and Training Tiny-DSOD network on tensorflow + keras. When starting 1st epoch, training is terminated with error: tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [7,128,2,2] vs. [7,128,3,3]

Batch size is 8, image size is (300,300) and the dataset used to train is PASCAL VOC 2007+2012. The error occurs between one of the outputs to the prediction layer(very similar to SSD) and loss: [[{{node add_fpn_0_/add}}]] [[{{node loss/add_50}}]]

Currently, the version of tensorflow is 1.13 and keras is 2.2.4. Python version is 3.6. I have checked everything from the model itself(the shapes are as expected), images being generated for the batches(each image is as expected), changing loss computation(currently using Adam, but tried with SGD as well, it is exactly the same problem.) and checked tensorboard if can provide any information(everything goes well until that point of termination).

history = model.fit_generator(generator=train_generator,
                   steps_per_epoch=math.ceil(n_train_samples/batch_size),
                          epochs=epochs,
                          callbacks=[tf.keras.callbacks.ModelCheckpoint('tinydsod300_weights_epoch--{epoch:02d}_loss--{loss:.4f}_val_loss--{val_loss:.4f}.h5',
                                                                        monitor='val_loss',
                                                                        verbose=1,
                                                                        save_best_only=True,
                                                                        save_weights_only=True,
                                                                        mode='auto', period=1),
                                     tf.keras.callbacks.LearningRateScheduler(lr_schedule),
                                     tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                                      min_delta=0.001,
                                                                      patience=2),
                                     tf.keras.callbacks.TerminateOnNaN(),
                                     tf.keras.callbacks.TensorBoard(log_dir='./logs'),
                                     tf.keras.callbacks.BaseLogger()],
                          validation_data=val_generator,
                          validation_steps=math.ceil(n_val_samples/batch_size)

Full error:

WARNING:tensorflow:From /home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2019-06-04 15:45:59.614299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-04 15:45:59.614330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-04 15:45:59.614337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-04 15:45:59.614341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-04 15:45:59.614513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2998 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Epoch 1/10
2019-06-04 15:46:28.296307: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.77GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
  File "/home/alexandre.pires/PycharmProjects/neural_networks/tiny-dsod.py", line 830, in <module>
    validation_steps=math.ceil(n_val_samples/batch_size)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1426, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 191, in model_iteration
    batch_outs = batch_function(*batch_data)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1191, in train_on_batch
    outputs = self._fit_function(ins)  # pylint: disable=not-callable
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [7,128,2,2] vs. [7,128,3,3]
     [[{{node add_fpn_0_/add}}]]
     [[{{node loss/add_50}}]]

One last thing to add, is that the previous output for prediction layer, indeed has shape [7,128,2,2], but this never originated any error. Any tips on where I should debug next? Or where is this error coming from exactly?

EDIT1 - CORRECTION

Some corrections were made in the model and a new error appeared, but still with the same incompatible shapes:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [8,128,2,2] vs. [8,128,3,3]
     [[{{node add_fpn_0_/add}}]]
     [[{{node loss/predictions_loss/broadcast_weights/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat}}]]

Depthwise convolution was corrected to act as intended in the original model(made in caffe).

Convolution

        layer_name = "conv_" + name
        output = tf.keras.layers.Conv2D(filters=filter, kernel_size=kernel, padding=pad,
                                        strides=stride, kernel_initializer=self.kernel_initializer,
                                        kernel_regularizer=self.regularize, name=layer_name)(input)
        output = tf.keras.layers.BatchNormalization(name=layer_name + "batch_")(output)
        output = tf.keras.layers.Activation('relu', name=layer_name + "relu_")(output)

        return output

DepthWise

        if stride == 2:
            output = tf.keras.layers.ZeroPadding2D(padding=self.correct_pad(input, kernel[0]),
                                                   name='zeropad_' + layer_name)(input)
            output = tf.keras.layers.DepthwiseConv2D(kernel_size=kernel, padding='SAME' if stride == 1 else 'VALID',
                                                     strides=stride, kernel_initializer=self.kernel_initializer,
                                                     kernel_regularizer=self.regularize, name=layer_name)(output)
        else:
            output = tf.keras.layers.DepthwiseConv2D(kernel_size=kernel, padding='SAME' if stride == 1 else 'VALID',
                                                     strides=stride, kernel_initializer=self.kernel_initializer,
                                                     kernel_regularizer=self.regularize, name=layer_name)(input)
        if use_batch_norm:
            output = tf.keras.layers.BatchNormalization(center=True, scale=True, trainable=True,
                                                        name=layer_name + "batch_")(output)
            output = tf.keras.layers.Activation('relu', name=layer_name + "relu_")(output)

Upsample(simple bilinear)

        layer_name = "upsample_" + name
        output = tf.keras.layers.UpSampling2D(size=(input_shape[0], input_shape[1]), interpolation='bilinear',
                                               name=layer_name)(input)
        output = self._depthwise_conv_2d(output, filter=128, kernel=(3, 3), pad='SAME', stride=1, name=layer_name)

        return output

i think that the problem is the images dimensions inside the network.

try change this part:

 output = self._depthwise_conv_2d(output, filter=128, kernel=(3, 3), pad='SAME', stride=1, name=layer_name)

for this.

 output = self._depthwise_conv_2d(output, filter=128, kernel=(2, 2), pad='SAME', stride=1, name=layer_name)

if you see the output the are telling for you that you have a output with 7 elements that is 128 filters with dimension 2 x 2 and you network has a output with 7 elements with 128 filters that has dimension 3 x 3.

let me know if i help.

I managed to solve the problem. The problem was located in the Upsampling layer. The model I based on was using bilinear upsampling x2 on caffe. Caffe implementation is different from the one in tensorflow/keras. I made a custom test layer to check this hypothesis and managed to fixed the problem. The upsampling layer I use is now this:

    def UpSampling2DBilinear(self, stride, **kwargs):
        def layer(x):
            input_shape = tf.keras.backend.int_shape(x)
            output_shape = (stride * (input_shape[1] - 1) + 1, stride * (input_shape[2] - 1) + 1)
            if output_shape[0] == 9:
                output_shape = (10,10)
            if output_shape[0] == 37:
                output_shape = (38,38)

            return tf.image.resize_bilinear(x, output_shape, align_corners=True)

        return tf.keras.layers.Lambda(layer, **kwargs)

Obviously, it is not the final custom layer solution, but for now, it works for input image size of (300,300).

So, for anybody in the future, who has a similar problem, here is a checklist of steps that can be very helpful to debug:

  • Incompatible shapes error on predictions, is most of the times linked to your model. It means that, in some step/s, you are doing something wrong. Double/Triple/Quadruple check every output of the model at every layer(keras has a model.summary() function to help in this case).

  • If the model you are implementing, its construction is based on Caffe(or any other framework different than the one you are using), check the implementation details of that layer. In my case, I had to change depthwise convolutions, max pooling and upsampling to fit the desired behaviour.

  • Make sure that loss functions, batch generators, etc are also entirely correct, to avoid further problems.

Hopefully, this will be helpful to many people in the future, fighting this type of error. Thanks everybody who tried to help me with this!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM