简体   繁体   中英

Image sequence training with CNN and RNN

I'm making my first steps learning Deep Learning. I am trying to do Activity Recognition from images sequences (frames) of videos. As a result i am facing a problem with the training procedure.

Firstly i need to determine the architecture of my images folders:

Making Food -> p1 -> rgb_frame1.png,rgb_frame2.png ... rgb_frame200.png
Making Food -> p2 -> rgb_frame1.png,rgb_frame2.png ... rgb_frame280.png
                      ...
                      ...
                      ...
Taking  Medicine -> p1 -> rgb_frame1.png,rgb_frame2.png...rgbframe500.png

                      etc..
      

So the problem is that each folder can have a different number of frames so I get confused both with the input shape of the model and the timesteps which I should use. I am creating a model (as you see bellow) with time distirbuted CNN(pre trained VGG16) and LSTM that takes an input all the frames of all classes with the coresponding labels (in the above example making food would be the coresponding label to p1_rgb_frame1 etc.) and the final shape of x_train is (9000,200,200,3) where 9000 coresponds to all frames from all classes, 200 is height & width and 3 the channel of images. I am reshaping this data to (9000,1,200,200,3) in order to be used as input to the model. I am wondering and worried that I do not pass a proper timestep, as a result a wrong training, i have val_acc ~ 98% but when testing with different dataset is much lower. Can you suggest another way to do it more efficient?

  x = base_model.output
  x = Flatten()(x)
  features = Dense(64, activation='relu')(x)
  conv_model = Model(inputs=base_model.input, outputs=features)    
  for layer in base_model.layers:
      layer.trainable = False
       
  model = Sequential()
  model.add(TimeDistributed(conv_model, input_shape=(None,200,200,3)))
  model.add(LSTM(32, return_sequences=True))
  model.add(LSTM(16))

The structure of your model isn't obviously bad as far as I can see. As far as the different number of frames issue goes, the solution is to simply not do that. Preprocess your data to take the same number of frames from each action.

The deeper issue here is more likely just simple overfitting. You don't specify, but based off the fact that you are talking about hosting your training data on a single computer, I imagine that you don't have much training data, and your network is not learning the activities, but rather just learning to recognize your training data. Consider that VGG16 had about 1.2 million distinct training examples and was trained for weeks on top-end GPUs, just to distinguish 1000 classes of static images. Arguably, learning temporal aspects and activities should require a similar amount of training data. You had a good idea to start with VGG as a base and add onto it so your network doesn't have to relearn static image recognition features, but the conceptual leap from static images to dynamic videos that your network needs to learn is still a big one!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM