Understanding weights from a convolutional layer

I'm trying to do semantic segmentation for magnetic resonance images, which are one channel images.

To get encoder from a U-Net network I use this function:

def get_encoder_unet(img_shape, k_init = 'glorot_uniform', bias_init='zeros'):

    inp = Input(shape=img_shape)
    conv1 = Conv2D(64, (5, 5), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv1_1')(inp)
    conv1 = Conv2D(64, (5, 5), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv1_2')(conv1)
    pool1 = MaxPooling2D(pool_size=(2, 2), data_format="channels_last", name='pool1')(conv1)
    conv2 = Conv2D(96, (3, 3), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv2_1')(pool1)
    conv2 = Conv2D(96, (3, 3), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv2_2')(conv2)
    pool2 = MaxPooling2D(pool_size=(2, 2), data_format="channels_last", name='pool2')(conv2)

    conv3 = Conv2D(128, (3, 3), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv3_1')(pool2)
    conv3 = Conv2D(128, (3, 3), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv3_2')(conv3)
    pool3 = MaxPooling2D(pool_size=(2, 2), data_format="channels_last", name='pool3')(conv3)

    conv4 = Conv2D(256, (3, 3), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv4_1')(pool3)
    conv4 = Conv2D(256, (4, 4), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv4_2')(conv4)
    pool4 = MaxPooling2D(pool_size=(2, 2), data_format="channels_last", name='pool4')(conv4)

    conv5 = Conv2D(512, (3, 3), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv5_1')(pool4)
    conv5 = Conv2D(512, (3, 3), activation='relu', padding='same', data_format="channels_last", kernel_initializer=k_init, bias_initializer=bias_init, name='conv5_2')(conv5)

    return conv5,conv4,conv3,conv2,conv1,inp

And its summary is:

Model: "encoder"
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 200, 200, 1)]     0         
conv1_1 (Conv2D)             (None, 200, 200, 64)      1664      
conv1_2 (Conv2D)             (None, 200, 200, 64)      102464    
pool1 (MaxPooling2D)         (None, 100, 100, 64)      0         
conv2_1 (Conv2D)             (None, 100, 100, 96)      55392     
conv2_2 (Conv2D)             (None, 100, 100, 96)      83040     
pool2 (MaxPooling2D)         (None, 50, 50, 96)        0         
conv3_1 (Conv2D)             (None, 50, 50, 128)       110720    
conv3_2 (Conv2D)             (None, 50, 50, 128)       147584    
pool3 (MaxPooling2D)         (None, 25, 25, 128)       0         
conv4_1 (Conv2D)             (None, 25, 25, 256)       295168    
conv4_2 (Conv2D)             (None, 25, 25, 256)       1048832   
pool4 (MaxPooling2D)         (None, 12, 12, 256)       0         
conv5_1 (Conv2D)             (None, 12, 12, 512)       1180160   
conv5_2 (Conv2D)             (None, 12, 12, 512)       2359808   
Total params: 5,384,832
Trainable params: 5,384,832
Non-trainable params: 0

I'm trying to understand how neural networks work, and I have this code to show the shape for the last layer weights and biases.

layer_dict = dict([(layer.name, layer) for layer in model.layers])

layer_name = model.layers[-1].name
#layer_name = 'conv5_2'

filter_index = 0 # Which filter in this block would you like to visualise?

# Grab the filters and biases for that layer
filters, biases = layer_dict[layer_name].get_weights()

print("\tType: ", type(filters))
print("\tShape: ", filters.shape)
print("\tType: ", type(biases))
print("\tShape: ", biases.shape)

With this output:

    Type:  <class 'numpy.ndarray'>
    Shape:  (3, 3, 512, 512)
    Type:  <class 'numpy.ndarray'>
    Shape:  (512,)

I'm trying to understand what Filters' shape means (3, 3, 512, 512) . I think the last 512 are the number of filters in this layer, but what (3, 3, 512) means? My images are one channel, so I don't understand that 3, 3 in the filters' shape ( img_shape is (200, 200, 1) ).

I think the last 512 are the number of filters in this layer, but what (3, 3, 512) means?

Means overall size of filters: they are 3D themselves. As input of conv5_2 you have [batch, height', width', channels] tensor. Filters in your case has size 3*3 per channel: you take every 3x3 region of conv5_2 input, applying 3x3 filter to it and get 1 value as output (seeanimation ). But those 3x3 filters are different for every channel (512 in your case) (seethis illustration for 1 channel). After all you want perform Conv2D number_of_filter times, so you need 512 filters of size 3x3x512.
Good article for deeper dive into intuition behind CNN architect and Conv2D in particular (see part 2)

