简体   繁体   中英

Size of Input and ConvNet

In CS231n course about Convolution Neural Network, in ConvNet note :

  • INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.

  • CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.

From the document, I understand that a INPUT will contain images with 32 (width) x 32 (height) x 3 depth. But later in result of Conv layer, it was [32x32x12] if we decided to use 12 filters. Where is the 3 as in depth of the image?

Please help me out here, thank you in advance.

It gets "distributed" to each feature map (result after convolution with filter).

Before thinking about 12 filters, just think of one. That is, you are applying convolution with a filter of [filter_width * filter_height * input_channel_number]. And because your input_channel_number is the same as filter channel, you basically applying input_channel_number of 2d convolution independently on each input channel and then sum them together. And the result is a 2D feature map.

Now you can repeat this 12 times to get 12 feature maps and stack them together to get your [32 x 32 x 12] feature volume. And that's why your filter size is a 4D vector with [filter_width * filter_height * input_channel_number * output_channel_number], in your case this should be something like [3x3x3x12] (please note the ordering may vary between different framework, but operation is the same)

So, this is fun. I have read the document again and found the answer which is some 'scroll down' away. Before, I thought the filter, for example, is 32 x 32 (no depth). The truth is:

A typical filter on a first layer of a ConvNet might have size 5x5x3 (ie 5 pixels width and height, and 3 because images have depth 3, the color channels).

During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM