[英]Tensorflow nn.conv3d() and max_pool3d
Recently Tensorflow added support for 3d convolution. 最近,Tensorflow增加了对3d卷积的支持。 I'm attempting to train some video stuff.
我正在尝试训练一些视频内容。
I have a few of questions: 我有几个问题:
My inputs are 16-frame, 3-channel per frame .npy
files, so their shape is: (128, 171, 48)
. 我的输入是每帧16帧,3通道
.npy
文件,所以它们的形状是: (128, 171, 48)
.npy
(128, 171, 48)
。
1) The docs for tf.nn.max_pool3d()
state the shape of the input should be: Shape [batch, depth, rows, cols, channels]
. 1)
tf.nn.max_pool3d()
)的文档 tf.nn.max_pool3d()
输入的形状应为: Shape [batch, depth, rows, cols, channels]
。 Is my channels dimension still 3 even though my npy imgs
are 48 channels deep , so to speak? 即使我的npy
imgs
是48个深度 , 我的频道维度仍然是3 ,可以这么说吗?
2) The next question dovetails from the last one: is my depth 48 or 16? 2)下一个问题与最后一个问题相吻合:我的深度是48还是16?
3) (since I'm here) The batch dimension is the same with 3d arrays, correct? 3) (因为我在这里)批量维度与3d数组相同,对吗? The images are just like any other image, processed one at a time.
图像就像任何其他图像一样,一次处理一个。
Just to be clear: in my case, for a single image batch size, with the image dims above, my dimensions are: 需要明确的是:在我的情况下,对于单个图像批量大小,上面的图像变暗,我的尺寸为:
[1(batch),16(depth), 171(rows), 128(cols), 3(channels)]
EDIT: I've confused raw input size with pooling and kernel sizes here. 编辑:我把原始输入大小与池和内核大小混淆了。 Perhaps some general guidance on this 3D stuff would be helpful.
也许对这些3D东西的一些一般指导会有所帮助。 I basically am stuck on the dimensions for both convolution and pooling, as is clear in the original question.
我基本上坚持卷积和汇集的维度,正如原始问题中所清楚的那样。
To answer your question, the dimension should be (as you stated): [batch_size, depth, H, W, 3]
where depth
is the number of time frames you have. 要回答您的问题,维度应该是(如您所述):
[batch_size, depth, H, W, 3]
其中depth
是您拥有的时间帧数。
For instance, a 5s video with 20 frames/s will have depth=100
. 例如,具有20帧/秒的5s视频将具有
depth=100
。
My best advice would be to first read the slides from CS231n about deep learning for videos here (if you can see the video , it's even better). 我最好的建议是首先从大约深度学习的视频CS231n读幻灯片在这里 (如果你能看到的视频 ,它甚至更好)。
Basically, a 3D convolution is the same as a 2D convolution but with one more dimension. 基本上,3D卷积与2D卷积相同,但具有一个维度。 Let's do a recap:
我们来回顾一下:
[batch_size, 10, in_channels]
[batch_size, 10, in_channels]
[3, in_channels, out_channels]
[3, in_channels, out_channels]
in_channels
in_channels
[batch_size, 10, 10, in_channels]
[batch_size, 10, 10, in_channels]
[3, 3, in_channels, out_channels]
[3, 3, in_channels, out_channels]
in_channels=3
in_channels=3
[batch_size, T, 10, 10, in_channels]
[batch_size, T, 10, 10, in_channels]
[T_kernel, 3, 3, in_channels, out_channels]
[T_kernel, 3, 3, in_channels, out_channels]
T=100
frames, and images of size 10x10, with in_channels=3
T=100
帧的视频,大小为10x10的图像, in_channels=3
T_kernel
(ex: T_kernel=10
) T_kernel
(例如: T_kernel=10
) The goal of a convolution is to reduce the number of parameters because of redundancies in the data. 卷积的目标是减少由于数据冗余而导致的参数数量。 For images, you can extract the same basic features in the top left 3x3 box and the bottom right 3x3 box.
对于图像,您可以在左上角3x3框和右下角3x3框中提取相同的基本功能。
For videos, this is the same. 对于视频,这是相同的。 You can extract information from a 3x3 box of the image, but within a time frame (ex: 10 frames).
您可以从图像的3x3框中提取信息,但是在一个时间范围内(例如:10帧)。 The result will have a receptive field of 3x3 in image dimension, and 10 frames in time dimension.
结果将在图像维度中具有3x3的感知域,并且在时间维度上具有10帧。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.