简体   繁体   中英

Keras CNN: Add text as additional input besides image to CNN

I am trying to train a CNN for object classification. As such, I would like to input some text features in addition to the image.

I found an example of this being done here http://cbonnett.github.io/Insight.html

The author constructs two models, a CNN for the image recognition and a normal ANN for the text. Finally he merges them together and applies a softmax activation. As such, his pipeline looks as follows:

merged = Merge([cnn_model, text_model], mode='concat')

### final_model takes the combined models and adds a sofmax classifier to it
final_model = Sequential()
final_model.add(merged)
final_model.add(Dropout(do))
final_model.add(Dense(n_classes, activation='softmax'))

I wonder if this is the preferred method of combining image + text or if there are alternative ways of solving such a task using Keras? Stated differently, would it be possible (or even make sense) to include the text as an input directly to the CNN, such that the CNN takes care of both images and text?

You are on the right track but yes you can also use a CNN to process text and it is often a faster alternative to using RNNs etc. But you can't use the same CNN to process both text and images , they must be different because text is 1D and image is 2D input not to mention they originate from separate source distributions. So, you'll still end up with 2 sub models if you will:

  1. Process the image using a CNN model.
  2. Process the text using another model (RNNs, ANNs, CNNs or just one-hot encode words etc). By CNN I mean usually a 1D CNN that runs over the words in a sentence.
  3. Merge the 2 latent spaces which tells information about the image and the text.
  4. Run last few Dense layers for classification.

Let me explained this way. You are first doing the convolution and the aggregated data goes to neural network. Instead of using one convolution, you have two convolution, one for text and one for image. The only additional step is that you concatenate two piece of information after you flat your convolution result. I suggest you look at my code at this link. This code is about using CNN on both title and description and concatenate them together. It is similar to your case that you take your text data as my 'description' and your image data as my 'title'.

https://www.kaggle.com/jingqliu/fasttext-conv2d-with-tf-on-title-and-description

It is written in tensor-flow but I believe you will get the idea!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM