简体   繁体   中英

How does tensorflow connect the dimensions of linked convolutional layers?

The is a very basic tensorflow question, but I haven't yet seen a clear explanation in the docs. Following the examples on the tensorflow site , we basically have these two layers connected:

conv1 = tf.layers.conv2d(
    inputs=input_layer,
    filters=32,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

The shape at this point will be (28, 28, 32) .

conv2 = tf.layers.conv2d(
    inputs=conv1,
    filters=64,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

The shape at this point will be (28, 28, 64) . How does tensorflow take the (28, 28, 32) and turn it into (28, 28, 64) using a 2d kernel. Could you please explain or point me to the documentation? How about when the output dimension of the second layer is smaller, say

conv2 = tf.layers.conv2d(
    inputs=conv1,
    filters=8,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

How would tensorflow combine the 32 dimensions into 8?

When you have 1 filter (=channel) in the input and the output as well, you will get one 5x5 convolution kernel, which is used to calculate the output for pixel (x, y) by taking the Hadamard (element-wise) product of the kernel and input[ x - 2 : x + 3, y - 2 : y + 3 ] and then taking the sum of the resulting 5x5 matrix, finally applying the activation function ( tf.nn.relu() in your case.) Since some of these coordinates point outside of the input, then the padding="SAME" comes into play and zero is used as a virtual element for outside positions for the sake of this calculation. Your network will learn the weights in the kernel.

Now let's suppose you have 2 filters in the input and 1 in the output. Then you have two kernels, and the output will be the sum of the result of the two separate operations described above applied to each input channel.

If you have 1 filters in the input and 2 in the output, then again, you have two kernels, and the two output channels for each pixel will be generated separately with the use of the corresponding kernels.

Now the big jump: if you have k filters in the input and n filters in the output, then you have ( k * n ) different kernels to learn, and each of the n channels for each pixel in the output will be calculated as the sum of k separate convolutions on each input channel. All ( k * n ) kernels will be learned by the network.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM