Tensorflow + Keras训练：InvalidArgumentError：不兼容的形状：[7,128,2,2] vs [7,128,3,3]

Question

Implementing and Training Tiny-DSOD network on tensorflow + keras. 在tensorflow + keras上实现和训练Tiny-DSOD网络。 When starting 1st epoch, training is terminated with error: tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [7,128,2,2] vs. [7,128,3,3] 当开始第1纪元时，训练终止，错误：tensorflow.python.framework.errors_impl.InvalidArgumentError：不兼容的形状：[7,128,2,2]与[7,128,3,3]

Batch size is 8, image size is (300,300) and the dataset used to train is PASCAL VOC 2007+2012. 批量大小为8，图像大小为（300,300），用于训练的数据集为PASCAL VOC 2007 + 2012。 The error occurs between one of the outputs to the prediction layer(very similar to SSD) and loss: [[{{node add_fpn_0_/add}}]] [[{{node loss/add_50}}]] 错误发生在预测层的一个输出（非常类似于SSD）和丢失之间：[[{{node add_fpn_0_ / add}}]] [[{{node loss / add_50}}]]

Currently, the version of tensorflow is 1.13 and keras is 2.2.4. 目前，tensorflow的版本是1.13，keras是2.2.4。 Python version is 3.6. Python版本是3.6。 I have checked everything from the model itself(the shapes are as expected), images being generated for the batches(each image is as expected), changing loss computation(currently using Adam, but tried with SGD as well, it is exactly the same problem.) and checked tensorboard if can provide any information(everything goes well until that point of termination). 我已经检查了模型本身的所有内容（形状是预期的），为批次生成的图像（每个图像都符合预期），更改损失计算（目前使用Adam，但尝试使用SGD，它完全相同）如果可以提供任何信息（一切顺利，直到终止点），问题。）和检查张量板。

history = model.fit_generator(generator=train_generator,
                   steps_per_epoch=math.ceil(n_train_samples/batch_size),
                          epochs=epochs,
                          callbacks=[tf.keras.callbacks.ModelCheckpoint('tinydsod300_weights_epoch--{epoch:02d}_loss--{loss:.4f}_val_loss--{val_loss:.4f}.h5',
                                                                        monitor='val_loss',
                                                                        verbose=1,
                                                                        save_best_only=True,
                                                                        save_weights_only=True,
                                                                        mode='auto', period=1),
                                     tf.keras.callbacks.LearningRateScheduler(lr_schedule),
                                     tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                                      min_delta=0.001,
                                                                      patience=2),
                                     tf.keras.callbacks.TerminateOnNaN(),
                                     tf.keras.callbacks.TensorBoard(log_dir='./logs'),
                                     tf.keras.callbacks.BaseLogger()],
                          validation_data=val_generator,
                          validation_steps=math.ceil(n_val_samples/batch_size)

Full error: 完整错误：

WARNING:tensorflow:From /home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2019-06-04 15:45:59.614299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-04 15:45:59.614330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-04 15:45:59.614337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-04 15:45:59.614341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-04 15:45:59.614513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2998 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Epoch 1/10
2019-06-04 15:46:28.296307: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.77GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
  File "/home/alexandre.pires/PycharmProjects/neural_networks/tiny-dsod.py", line 830, in <module>
    validation_steps=math.ceil(n_val_samples/batch_size)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1426, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 191, in model_iteration
    batch_outs = batch_function(*batch_data)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1191, in train_on_batch
    outputs = self._fit_function(ins)  # pylint: disable=not-callable
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [7,128,2,2] vs. [7,128,3,3]
     [[{{node add_fpn_0_/add}}]]
     [[{{node loss/add_50}}]]

One last thing to add, is that the previous output for prediction layer, indeed has shape [7,128,2,2], but this never originated any error. 最后要补充的是，预测层的先前输出确实具有形状[7,128,2,2]，但这从未产生任何错误。 Any tips on where I should debug next? 关于我接下来要调试的地方的任何提示？ Or where is this error coming from exactly? 或者这个错误究竟来自哪里？

EDIT1 - CORRECTION 编辑1 - 更正

Some corrections were made in the model and a new error appeared, but still with the same incompatible shapes: 在模型中进行了一些修正，出现了新的错误，但仍然具有相同的不兼容形状：

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [8,128,2,2] vs. [8,128,3,3]
     [[{{node add_fpn_0_/add}}]]
     [[{{node loss/predictions_loss/broadcast_weights/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat}}]]

Depthwise convolution was corrected to act as intended in the original model(made in caffe). 深度卷积被校正为在原始模型中使用（在caffe中制作）。

Convolution 卷积

        layer_name = "conv_" + name
        output = tf.keras.layers.Conv2D(filters=filter, kernel_size=kernel, padding=pad,
                                        strides=stride, kernel_initializer=self.kernel_initializer,
                                        kernel_regularizer=self.regularize, name=layer_name)(input)
        output = tf.keras.layers.BatchNormalization(name=layer_name + "batch_")(output)
        output = tf.keras.layers.Activation('relu', name=layer_name + "relu_")(output)

        return output

DepthWise 在深度上

        if stride == 2:
            output = tf.keras.layers.ZeroPadding2D(padding=self.correct_pad(input, kernel[0]),
                                                   name='zeropad_' + layer_name)(input)
            output = tf.keras.layers.DepthwiseConv2D(kernel_size=kernel, padding='SAME' if stride == 1 else 'VALID',
                                                     strides=stride, kernel_initializer=self.kernel_initializer,
                                                     kernel_regularizer=self.regularize, name=layer_name)(output)
        else:
            output = tf.keras.layers.DepthwiseConv2D(kernel_size=kernel, padding='SAME' if stride == 1 else 'VALID',
                                                     strides=stride, kernel_initializer=self.kernel_initializer,
                                                     kernel_regularizer=self.regularize, name=layer_name)(input)
        if use_batch_norm:
            output = tf.keras.layers.BatchNormalization(center=True, scale=True, trainable=True,
                                                        name=layer_name + "batch_")(output)
            output = tf.keras.layers.Activation('relu', name=layer_name + "relu_")(output)

Upsample(simple bilinear) 上例（简单双线性）

        layer_name = "upsample_" + name
        output = tf.keras.layers.UpSampling2D(size=(input_shape[0], input_shape[1]), interpolation='bilinear',
                                               name=layer_name)(input)
        output = self._depthwise_conv_2d(output, filter=128, kernel=(3, 3), pad='SAME', stride=1, name=layer_name)

        return output

Answer 1

i think that the problem is the images dimensions inside the network. 我认为问题是网络内的图像尺寸。

try change this part: 尝试更改此部分：

 output = self._depthwise_conv_2d(output, filter=128, kernel=(3, 3), pad='SAME', stride=1, name=layer_name)

for this. 为了这。

 output = self._depthwise_conv_2d(output, filter=128, kernel=(2, 2), pad='SAME', stride=1, name=layer_name)

if you see the output the are telling for you that you have a output with 7 elements that is 128 filters with dimension 2 x 2 and you network has a output with 7 elements with 128 filters that has dimension 3 x 3. 如果您看到输出告诉您，您的输出包含7个元素，即128个过滤器，维度为2 x 2，并且您的网络具有7个元素的输出，其中128个过滤器的维度为3 x 3。

let me know if i help. 如果我帮忙，请告诉我。

Answer 2

I managed to solve the problem. 我设法解决了这个问题。 The problem was located in the Upsampling layer. 该问题位于Upsampling层。 The model I based on was using bilinear upsampling x2 on caffe. 我基于的模型是在caffe上使用双线性上采样x2。 Caffe implementation is different from the one in tensorflow/keras. Caffe实现与tensorflow / keras中的实现不同。 I made a custom test layer to check this hypothesis and managed to fixed the problem. 我制作了一个自定义测试层来检查这个假设，并设法修复了问题。 The upsampling layer I use is now this: 我使用的上采样层现在是这样的：

    def UpSampling2DBilinear(self, stride, **kwargs):
        def layer(x):
            input_shape = tf.keras.backend.int_shape(x)
            output_shape = (stride * (input_shape[1] - 1) + 1, stride * (input_shape[2] - 1) + 1)
            if output_shape[0] == 9:
                output_shape = (10,10)
            if output_shape[0] == 37:
                output_shape = (38,38)

            return tf.image.resize_bilinear(x, output_shape, align_corners=True)

        return tf.keras.layers.Lambda(layer, **kwargs)

Obviously, it is not the final custom layer solution, but for now, it works for input image size of (300,300). 显然，它不是最终的自定义层解决方案，但目前，它适用于输入图像大小（300,300）。

So, for anybody in the future, who has a similar problem, here is a checklist of steps that can be very helpful to debug: 因此，对于将来遇到类似问题的人，这里有一个对调试非常有帮助的步骤清单：

Incompatible shapes error on predictions, is most of the times linked to your model. 预测中不兼容的形状错误，大部分时间与您的模型相关联。 It means that, in some step/s, you are doing something wrong. 这意味着，在某些步骤中，您做错了什么。 Double/Triple/Quadruple check every output of the model at every layer(keras has a model.summary() function to help in this case). Double / Triple / Quadruple检查每一层模型的每个输出（keras在这种情况下有一个model.summary（）函数来帮助）。
If the model you are implementing, its construction is based on Caffe(or any other framework different than the one you are using), check the implementation details of that layer. 如果您要实现的模型，其构造基于Caffe（或任何其他不同于您正在使用的框架），请检查该层的实现细节。 In my case, I had to change depthwise convolutions, max pooling and upsampling to fit the desired behaviour. 就我而言，我不得不改变深度卷积，最大池化和上采样以适应所需的行为。
Make sure that loss functions, batch generators, etc are also entirely correct, to avoid further problems. 确保损失函数，批处理生成器等也完全正确，以避免进一步的问题。

Hopefully, this will be helpful to many people in the future, fighting this type of error. 希望这对未来的许多人有所帮助，可以解决这类错误。 Thanks everybody who tried to help me with this! 谢谢所有试图帮助我的人！

Tensorflow + Keras训练：InvalidArgumentError：不兼容的形状：[7,128,2,2] vs [7,128,3,3]

问题描述

Full error: 完整错误：

EDIT1 - CORRECTION 编辑1 - 更正

Convolution 卷积

DepthWise 在深度上

Upsample(simple bilinear) 上例（简单双线性）

2 个解决方案

解决方案1
0 2019-06-04 17:16:21

解决方案2
0 已采纳 2019-06-06 10:40:08

Tensorflow + Keras训练：InvalidArgumentError：不兼容的形状：[7,128,2,2] vs [7,128,3,3]

问题描述

Full error: 完整错误：

EDIT1 - CORRECTION 编辑1 - 更正

Convolution 卷积

DepthWise 在深度上

Upsample(simple bilinear) 上例（简单双线性）

2 个解决方案

解决方案1 0 2019-06-04 17:16:21

解决方案2 0 已采纳 2019-06-06 10:40:08

解决方案1
0 2019-06-04 17:16:21

解决方案2
0 已采纳 2019-06-06 10:40:08