简体繁体 English

Kernel 3D 卷积的大小

[英]Kernel Size for 3D Convolution

原文 2022-01-24 22:41:56 6 1 deep-learning/ neural-network/ pytorch/ conv-neural-network

The kernel size of 3D convolution is defined using depth, height and width in Pytorch or TensorFlow. 3D 卷积的 kernel 大小是使用 Pytorch 或 Z074DD699710DA0EC1EB345F13B317 中的深度、高度和宽度定义的。 For example, if we consider a CT/MRI image data with 300 slices, the input tensor can be (1,1,300,128,128), corresponding to (N,C,D,H,W).例如，如果我们考虑具有 300 个切片的 CT/MRI 图像数据，输入张量可以是 (1,1,300,128,128)，对应于 (N,C,D,H,W)。 Then, the kernel size can be (3,3,3) for depth, height and width.然后，kernel 的深度、高度和宽度可以是 (3,3,3)。 When doing 3D convolution, the kernel is passed in 3 directions.在做3D卷积时，kernel在3个方向上传递。

However, I was confused if we change the situation from CT/MRI to a colourful video.但是，如果我们将情况从 CT/MRI 更改为彩色视频，我会感到困惑。 Let the video has 300 frames, then the input tensor will be (1,3,300,128,128) because of 3 channels for RGB images.让视频有 300 帧，那么输入张量将是 (1,3,300,128,128)，因为 RGB 图像有 3 个通道。 I know that for a single RGB image, the kernel size can be 3X3X3 for channels, height and width.我知道对于单个 RGB 图像，通道、高度和宽度的 kernel 大小可以是 3X3X3。 But when it comes to a video, it seems both Pytorch and Tensorflow still use depth, height and width to set the kernel size.但是对于视频，Pytorch 和 Tensorflow 似乎仍然使用深度、高度和宽度来设置 kernel 大小。 My question is, if we still use a kernel of (3,3,3), is there a potential fourth dimension for the colour channels?我的问题是，如果我们仍然使用 (3,3,3) 的 kernel，颜色通道是否存在潜在的第四维？

1 个解决方案

Yes.是的。

Actually the convolution operation occurring in a CNN is one dimension higher than its namesake.实际上，CNN 中发生的卷积操作比其同名的要高一维。 The channel dimension is always spanned by the entire kernel though, so there's no sliding along the channel dimension.通道尺寸总是由整个 kernel 跨越，因此没有沿通道尺寸滑动。 For example, a 2D convolution layer with kernel size set to 5x5 applied to a 3 channel input is actually using a kernel of shape 3x5x5 (assuming channel first notation).例如，将 kernel 大小设置为 5x5 的 2D 卷积层应用于 3 通道输入，实际上是使用形状为 3x5x5 的 kernel（假设通道优先表示法）。 Each output channel is the result of convolving the input with a different 3x5x5 kernel, so there is one of these 3x5x5 kernels for each output channel.每个 output 通道是输入与不同的 3x5x5 kernel 卷积的结果，因此每个 Z78E6221F6393D1356681DB393D2Z 通道都有这些 3x5x5 内核之一。

This is the same for videos.视频也是如此。 A 3D convolution layer is actually performing a 4D convolution in the same way.一个 3D 卷积层实际上是以相同的方式执行 4D 卷积。 So an input of shape 1x3x300x128x128 with kernel size set to 3x3x3 will actually be performing 4D convolutions with kernels of shape 3x3x3x3.因此，形状为 1x3x300x128x128 且 kernel 大小设置为 3x3x3 的输入实际上将使用形状为 3x3x3x3 的内核执行 4D 卷积。