简体   繁体   中英

tensorflow - understanding tensor shapes for convolution

Currently trying to work my way through the Tensorflow MNIST tutorial for convolutional networks and I could use some help with understanding the dimensions of the darn tensors.

So we have images of 28x28 pixels in size.

The convolution will compute 32 features for each 5x5 patch.

Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.

Its weight tensor will have a shape of [5, 5, 1, 32] . The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels.

W_conv1 = weight_variable([5, 5, 1, 32])

b_conv1 = bias_variable([32])

If you say so ...

To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.

x_image = tf.reshape(x, [-1,28,28,1])

Alright, now I'm getting lost.

Judging by this last reshape, we have "howevermany" 28x28x1 "blocks" of pixels that are our images.

I guess this makes sense because the images are in greyscale

However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.

The x32 makes sense, I guess, if we want to infer 32 features per patch

The rest, though, I'm not terribly convinced by.

Why does the weight tensor look the way it apparently does?

(For completeness: we use them

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

where

def conv2d(x,W):
    '''
    2D convolution, expects 4D input x and filter matrix W
    '''
    return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME')

def max_pool_2x2(x):
    '''
    max-pooling, using 2x2 patches
    '''
    return tf.nn.max_pool(x,ksize=[1,2,2,1], strides=[1,2,2,1],padding='SAME')

)

Your input tensor has the shape [-1,28,28,1] . Like you mention, the last dimension is 1 because the images are in greyscale. The first index is the batchsize. The convolution will process every image in the batch independently, therefore the batchsize has no influence on the convolution-weight-tensor dimensions, or, in fact, no influence on any weight-tensor dimensions in the network. That is why the batchsize can be arbitrary ( -1 signifies arbitrary size in tensorflow).

Now to the weight tensor; you don't have five of 5x1x32 -blocks, you rather have 32 of 5x5x1 -blocks. Each represents one feature. The 1 is the depth of the patch and is 1 due to the gray scale (it would be 5x5x3x32 for color images). The 5x5 is the size of the patch.

The ordering of dimensions in the data tensors is different from the ordering of dimensions in the convolution weight tensors.

Beside the other answer, I would like to add some more points,

Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.

There is no specific reason why we choose 5x5 patches or 32 features, all of this parameters are experienced (except in some cases), you may use 3x3 patches or larger feature size.

I said 'except in some cases', because may we use 3x3 patches to catch information from images in more details, or larger feature size to learn each image in more details ('larger' and 'more details' are relative terms in this case).

However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.

Not exactly, but the weight tensor is not a collection it is only a filter with size 5x5 and input channel 1 and output feature (channel) 32

Why does the weight tensor look the way it apparently does?

The weight tensor weight_variable([5, 5, 1, 32]) tells I have 5x5 patch size to apply on an image, I have 1 input feature (since images are in grayscale) and 32 output feature (channel).

More Details:

So this line tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes input x as [-1,28,28,1] , -1 means you can put in this dimension any size you want (batch size), 28,28 shows input size, and it must be exactly 28x82, and the last 1 shows the number of input channel, since the mnist images are grayscale so it is 1 , in more details it says input image is a 28x28 2D matrix and each cell of matrix shows a value which indicates the grayscale intensity. If input images were RGB so we should have 3 channel instead 1 , and this 3 channel says input image is a 28x28x3 3D matrix, the cells in the first dimension of 3 shows the intensity of Red color, the second dimension of 3 shows the intensity of Green color and the other shows Blue color.

Now tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes x and apply W ( which is a 3x3 patches and apply whis patch on 28x28 image with step size 1 (since stride is 1) and give the result image again in size 28x28 because we use padding='SAME'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM