在LSTM中提供CNN的输出

Question

It is the first time that I am working with the LSTM networks. 这是我第一次使用LSTM网络。 I have a video with a frame rate of 30 fps. 我有一个视频，帧率为30 fps。 I have a CNN network (AlexNet based) and I want to feed the last layer of my CNN network into the recurrent network (I am using tensorflow). 我有一个CNN网络（基于AlexNet），我想将CNN网络的最后一层反馈到循环网络中（我正在使用tensorflow）。 Supposing that my batch_size=30 , so equal to the fps, and I want to have a timestep of 1 second (so, every 30 frames). 假设我的batch_size=30 ，等于fps，我想有一个1 second的时间步长（因此，每30帧）。 The output of the last layer of my network will be [bast_size, 1000] , so in my case [30, 1000] , now do I have to reshape the size of my output to [batch_size, time_steps, features] (in my case: [30, 30, 1000] )? 网络的最后一层的输出将为[bast_size, 1000] ，所以在我的情况下为[bast_size, 1000] [30, 1000] ，现在我是否必须将输出的大小调整为[batch_size, time_steps, features] （在我的情况下： [30, 30, 1000] ）？ Is that correct? 那是对的吗？ or am I wrong? 还是我错了？

Answer 1

Consider to build your CNN model with Conv2D and MaxPool2D layers, until you reach your Flatten layer, because the vectorized output from the Flatten layer will be you input data to the LSTM part of your structure. 考虑到使用Conv2D和MaxPool2D层构建CNN模型，直到到达Flatten层，因为Flatten层的矢量化输出将是您向结构的LSTM部分输入数据。

So, build your CNN model like this: 因此，像这样构建您的CNN模型：

model_cnn = Sequential()
model_cnn.add(Conv2D...)
model_cnn.add(MaxPooling2D...)
...
model_cnn.add(Flatten())

Now, this is an interesting point, the current version of Keras has some incompatibility with some TensorFlow structures that will not let you stack your entire layers in just one Sequential object. 现在，这很有趣，当前的Keras版本与某些TensorFlow结构不兼容，这些结构不会让您将整个图层堆叠在一个Sequential对象中。

So it's time to use the Keras Model Object to complete you neural network with a trick: 因此，是时候使用Keras 模型对象通过一个技巧来完善您的神经网络了：

input_lay = Input(shape=(None, ?, ?, ?)) #dimensions of your data
time_distribute = TimeDistributed(Lambda(lambda x: model_cnn(x)))(input_lay) # keras.layers.Lambda is essential to make our trick work :)
lstm_lay = LSTM(?)(time_distribute)
output_lay = Dense(?, activation='?')(lstm_lay)

And finally, now it's time to put together our 2 separated models: 最后，现在是时候将我们两个分离的模型放在一起：

model = Model(inputs=[input_lay], outputs=[output_lay])
model.compile(...)

Now, on our OpenCV part , use an algorithm like the one shown below to preprocess your videos directly in order to build a big tensor of frames to you feed in your network: 现在，在我们的OpenCV部分 ，使用如下所示的算法直接预处理您的视频，以便为您的网络中的馈送建立一个大张量的帧：

video_folder = '/path.../'
X_data = []
y_data = []
list_of_videos = os.listdir(vide_folder)

for i in list_of_videos:
    #Video Path
    vid = str(video_folder + i) #path to each video from list1 = os.listdir(path)
    #Reading the Video
    cap = cv2.VideoCapture(vid)
    #Reading Frames
    #fps = vcap.get(5)
    #To Store Frames
    frames = []
    for j in range(40): #here we get 40 frames, for example
        ret, frame = cap.read()
        if ret == True:
            print('Class 1 - Success!')
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) #converting to gray
            frame = cv2.resize(frame,(30,30),interpolation=cv2.INTER_AREA)
            frames.append(frame)
        else:
            print('Error!')
    X_data.append(frames) #appending each tensor of 40 frames resized for 30x30
    y_data.append(1) #appending a class label to the set of 40 frames
X_data = np.array(X_data)
y_data = np.array(y_data) #ready to split! :)

Just train it! 训练吧！ :) :)

Answer 2

If you merge several small sequences from different videos to form a batch, the output of the last layer of your model (the RNN) should already be [batch_size, window_size, num_classes]. 如果您合并了来自不同视频的几个小序列以形成一个批处理，则模型的最后一层（RNN）的输出应该已经是[batch_size，window_size，num_classes]。 Basically, you want to wrap your CNN with reshape layers which will concatenate the frames from each batch: 基本上，您想用整形层包裹CNN，这些整形层将连接每批中的帧：

input -> [batch_size, window_size, nchannels, height, width] 输入-> [batch_size，window_size，nchannels，高度，宽度]
reshape -> [batch_size * window_size, nchannels, height, width] 调整形状-> [batch_size * window_size，nchannels，高度，宽度]
CNN -> [batch_size * window_size, feat_size] CNN-> [batch_size * window_size，feat_size]
reshape -> [batch_size, window_size, feats_size] 调整形状-> [batch_size，window_size，feats_size]
RNN -> [batch_size, window_size, num_outputs] (assuming frame-wise predictions) RNN-> [batch_size，window_size，num_outputs]（假设逐帧预测）

But this will take a lot of memory, so you can set batch size to 1, which is what you seem to be doing if I understood correctly. 但这会占用大量内存，因此您可以将批处理大小设置为1，如果我理解正确的话，这似乎是您要执行的操作。 In this case you can spare the first reshape. 在这种情况下，您可以保留第一个重塑形状。

I'm not sure about the order of the axes above, but the general logic remains the same. 我不确定上述轴的顺序，但是一般逻辑保持不变。

As a side note: if you plan on using Batch Normalization at some point, you may want to raise the batch size because consecutive frames from a single segment might not contain a lot of variety by themselves. 附带说明：如果您计划在某个时候使用批处理规范化，则可能要增加批处理大小，因为单个段中的连续帧本身可能不会包含很多变化。 Also give a double check on the batch normalization axes which should cover both time and batch axes. 还要仔细检查批处理归一化轴，该轴应同时覆盖时间轴和批处理轴。

在LSTM中提供CNN的输出

问题描述

2 个解决方案

解决方案1
1 2018-07-27 01:39:10

解决方案2
0 2018-04-05 09:42:52

在LSTM中提供CNN的输出

问题描述

2 个解决方案

解决方案1 1 2018-07-27 01:39:10

解决方案2 0 2018-04-05 09:42:52

解决方案1
1 2018-07-27 01:39:10

解决方案2
0 2018-04-05 09:42:52