简体繁体 English

Tensorflow nn.conv3d（）和max_pool3d

[英]Tensorflow nn.conv3d() and max_pool3d

原文 2016-06-24 03:42:42 9 1 multidimensional-array/ tensorflow

Recently Tensorflow added support for 3d convolution. 最近，Tensorflow增加了对3d卷积的支持。 I'm attempting to train some video stuff. 我正在尝试训练一些视频内容。

I have a few of questions: 我有几个问题：

My inputs are 16-frame, 3-channel per frame .npy files, so their shape is: (128, 171, 48) . 我的输入是每帧16帧，3通道.npy文件，所以它们的形状是： (128, 171, 48) .npy (128, 171, 48) 。

1) The docs for tf.nn.max_pool3d() state the shape of the input should be: Shape [batch, depth, rows, cols, channels] . 1） tf.nn.max_pool3d() ）的文档 tf.nn.max_pool3d()输入的形状应为： Shape [batch, depth, rows, cols, channels] 。 Is my channels dimension still 3 even though my npy imgs are 48 channels deep , so to speak? 即使我的npy imgs是48个深度 ， 我的频道维度仍然是3 ，可以这么说吗？

2) The next question dovetails from the last one: is my depth 48 or 16? 2）下一个问题与最后一个问题相吻合：我的深度是48还是16？

3) (since I'm here) The batch dimension is the same with 3d arrays, correct? 3）（因为我在这里）批量维度与3d数组相同，对吗？ The images are just like any other image, processed one at a time. 图像就像任何其他图像一样，一次处理一个。

Just to be clear: in my case, for a single image batch size, with the image dims above, my dimensions are: 需要明确的是：在我的情况下，对于单个图像批量大小，上面的图像变暗，我的尺寸为：

[1(batch),16(depth), 171(rows), 128(cols), 3(channels)]

EDIT: I've confused raw input size with pooling and kernel sizes here. 编辑：我把原始输入大小与池和内核大小混淆了。 Perhaps some general guidance on this 3D stuff would be helpful. 也许对这些3D东西的一些一般指导会有所帮助。 I basically am stuck on the dimensions for both convolution and pooling, as is clear in the original question. 我基本上坚持卷积和汇集的维度，正如原始问题中所清楚的那样。

1 个解决方案

To answer your question, the dimension should be (as you stated): [batch_size, depth, H, W, 3] where depth is the number of time frames you have. 要回答您的问题，维度应该是（如您所述）： [batch_size, depth, H, W, 3]其中depth是您拥有的时间帧数。

For instance, a 5s video with 20 frames/s will have depth=100 . 例如，具有20帧/秒的5s视频将具有depth=100 。

My best advice would be to first read the slides from CS231n about deep learning for videos here (if you can see the video , it's even better). 我最好的建议是首先从大约深度学习的视频CS231n读幻灯片在这里（如果你能看到的视频，它甚至更好）。

Basically, a 3D convolution is the same as a 2D convolution but with one more dimension. 基本上，3D卷积与2D卷积相同，但具有一个维度。 Let's do a recap: 我们来回顾一下：

1D convolution (ex: text): 1D卷积（例如：文本）：

the input is of shape [batch_size, 10, in_channels] 输入的形状[batch_size, 10, in_channels]
the kernel is of shape [3, in_channels, out_channels] 内核的形状[3, in_channels, out_channels]
ex: for text, this is a sentence of length 10, with word embeddings of dim in_channels 例如：对于文本，这是一个长度为10的句子，其中嵌入了暗淡的in_channels
the kernel goes over the sentence (dim 10) with a kernel of size 3 内核用大小为3的内核遍历句子（dim 10）

2D convolution (ex: image): 2D卷积（例如：图像）：

the input is of shape [batch_size, 10, 10, in_channels] 输入的形状[batch_size, 10, 10, in_channels]
the kernel is of shape [3, 3, in_channels, out_channels] 内核的形状[3, 3, in_channels, out_channels]
ex: RGB image of size 10x10, with in_channels=3 例如：大小为10x10的RGB图像， in_channels=3
the kernel goes over the image (dim 10x10) with a kernel of size 3 内核使用大小为3的内核遍历图像（昏暗10x10）
the kernel is a square sliding over the image 内核是在图像上滑动的正方形

3D convolution (ex: video) 3D卷积（例如：视频）

the input is of shape [batch_size, T, 10, 10, in_channels] 输入的形状[batch_size, T, 10, 10, in_channels]
the kernel is of shape [T_kernel, 3, 3, in_channels, out_channels] 内核的形状[T_kernel, 3, 3, in_channels, out_channels]
ex: video with T=100 frames, and images of size 10x10, with in_channels=3 例如： T=100帧的视频，大小为10x10的图像， in_channels=3
the kernel goes over the video (dim 100x10x10) with a kernel of size T_kernel (ex: T_kernel=10 ) 内核遍历视频（昏暗的100x10x10），内核大小为T_kernel （例如： T_kernel=10 ）
the kernel is like a cube, sliding over the "cube" of the video (time * W * H) 内核就像一个立方体，滑过视频的“立方体”（时间* W * H）

The goal of a convolution is to reduce the number of parameters because of redundancies in the data. 卷积的目标是减少由于数据冗余而导致的参数数量。 For images, you can extract the same basic features in the top left 3x3 box and the bottom right 3x3 box. 对于图像，您可以在左上角3x3框和右下角3x3框中提取相同的基本功能。

For videos, this is the same. 对于视频，这是相同的。 You can extract information from a 3x3 box of the image, but within a time frame (ex: 10 frames). 您可以从图像的3x3框中提取信息，但是在一个时间范围内（例如：10帧）。 The result will have a receptive field of 3x3 in image dimension, and 10 frames in time dimension. 结果将在图像维度中具有3x3的感知域，并且在时间维度上具有10帧。