简体   繁体   English

使用 CNN 和 RNN 进行图像序列训练

[英]Image sequence training with CNN and RNN

I'm making my first steps learning Deep Learning.我正在迈出学习深度学习的第一步。 I am trying to do Activity Recognition from images sequences (frames) of videos.我正在尝试从视频的图像序列(帧)中进行活动识别。 As a result i am facing a problem with the training procedure.结果,我在培训过程中遇到了问题。

Firstly i need to determine the architecture of my images folders:首先我需要确定我的图像文件夹的架构:

Making Food -> p1 -> rgb_frame1.png,rgb_frame2.png ... rgb_frame200.png
Making Food -> p2 -> rgb_frame1.png,rgb_frame2.png ... rgb_frame280.png
                      ...
                      ...
                      ...
Taking  Medicine -> p1 -> rgb_frame1.png,rgb_frame2.png...rgbframe500.png

                      etc..
      

So the problem is that each folder can have a different number of frames so I get confused both with the input shape of the model and the timesteps which I should use.所以问题是每个文件夹可以有不同数量的帧,所以我对 model 的输入形状和我应该使用的时间步感到困惑。 I am creating a model (as you see bellow) with time distirbuted CNN(pre trained VGG16) and LSTM that takes an input all the frames of all classes with the coresponding labels (in the above example making food would be the coresponding label to p1_rgb_frame1 etc.) and the final shape of x_train is (9000,200,200,3) where 9000 coresponds to all frames from all classes, 200 is height & width and 3 the channel of images.我正在创建一个 model(如下所示),它具有时间分布的 CNN(预训练的 VGG16)和 LSTM,它输入所有类的所有帧和对应的标签(在上面的例子中,制作食物将是对应的 label 到 p1等), x_train的最终形状是(9000,200,200,3) ,其中9000对应于所有类的所有帧, 200是高度和宽度, 3是图像的通道。 I am reshaping this data to (9000,1,200,200,3) in order to be used as input to the model.我正在将此数据重塑为(9000,1,200,200,3) ,以便用作 model 的输入。 I am wondering and worried that I do not pass a proper timestep, as a result a wrong training, i have val_acc ~ 98% but when testing with different dataset is much lower.我想知道并担心我没有通过适当的时间步长,结果是错误的训练,我的 val_acc ~ 98% 但是当使用不同的数据集进行测试时要低得多。 Can you suggest another way to do it more efficient?你能建议另一种更有效的方法吗?

  x = base_model.output
  x = Flatten()(x)
  features = Dense(64, activation='relu')(x)
  conv_model = Model(inputs=base_model.input, outputs=features)    
  for layer in base_model.layers:
      layer.trainable = False
       
  model = Sequential()
  model.add(TimeDistributed(conv_model, input_shape=(None,200,200,3)))
  model.add(LSTM(32, return_sequences=True))
  model.add(LSTM(16))

The structure of your model isn't obviously bad as far as I can see.据我所知,您的 model 的结构并没有明显糟糕。 As far as the different number of frames issue goes, the solution is to simply not do that.就不同数量的帧问题而言,解决方案是根本不这样做。 Preprocess your data to take the same number of frames from each action.预处理您的数据以从每个操作中获取相同数量的帧。

The deeper issue here is more likely just simple overfitting.这里更深层次的问题很可能只是简单的过度拟合。 You don't specify, but based off the fact that you are talking about hosting your training data on a single computer, I imagine that you don't have much training data, and your network is not learning the activities, but rather just learning to recognize your training data.您没有具体说明,但基于您正在谈论将训练数据托管在单台计算机上的事实,我想您没有太多的训练数据,并且您的网络不是在学习活动,而只是在学习识别您的训练数据。 Consider that VGG16 had about 1.2 million distinct training examples and was trained for weeks on top-end GPUs, just to distinguish 1000 classes of static images.考虑到 VGG16 有大约 120 万个不同的训练示例,并且在高端 GPU 上训练了数周,只是为了区分 1000 个类别的 static 图像。 Arguably, learning temporal aspects and activities should require a similar amount of training data.可以说,学习时间方面和活动应该需要类似数量的训练数据。 You had a good idea to start with VGG as a base and add onto it so your network doesn't have to relearn static image recognition features, but the conceptual leap from static images to dynamic videos that your network needs to learn is still a big one!您有一个好主意,以 VGG 为基础并添加到它,这样您的网络就不必重新学习 static 图像识别功能,但是从 static 图像到您的网络需要学习的动态视频的概念飞跃仍然是一个很大的问题一!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM