Validation accuracy is much less than Training accuracy

Question

I am using MOSI dataset for the multimodal sentiment analysis, where for now I am training the model for text dataset only. For text, I am using glove embeddings of 300 dimensions for processing text. My total vocab size is 2173 and my padded sequence length is 30. My target array is [0,0,0,0,0,0,1] where left most is highly -ve and right most highly +ve.

I am splitting the dataset like this

X_train, X_test, y_train, y_test = train_test_split(WDatasetX, y7, test_size=0.20, random_state=42)

My tokenization process is

MAX_NB_WORDS = 3000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS,oov_token = "OOV")
tokenizer.fit_on_texts(Text_X_Train)
tokenized_X_train = tokenizer.texts_to_sequences(Text_X_Train)
tokenized_X_test = tokenizer.texts_to_sequences(Text_X_Test)

My embedding matrix:

vocab_size = len(tokenizer.word_index)+1
emb_mean=0
def embedding_matrix_filteration():
    all_embs = np.stack(list(embeddings_index.values()))
    print(all_embs.shape)
    emb_mean, emb_std = np.mean(all_embs), np.std(all_embs)
    print(emb_mean)
    embedding_matrix = np.random.normal(emb_mean, emb_std, (vocab_size, embed_dim)) gives the matrix of specified
                                                                    size filled with values from gauss distribution
    print(embedding_matrix.shape)
     print("length of word2id:",len(word2id))
    embeddedCount = 0
    not_found = []
    for word, idx in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word.lower())
        if word == ' ':
            embedding_vector = np.zeros_like(emb_mean)
        if embedding_vector is not None: 
            embedding_matrix[idx] = embedding_vector
            embeddedCount += 1
        else:
            print(word)
            print("$$$")
    print('total embedded:',embeddedCount,'common words')# words common between glove vector and wordset
    print("length of word2id:",len(word2id))
    print(len(embedding_matrix))
    return embedding_matrix

emb = embedding_matrix_filteration()

Model Architecture:

Embedding Layer:

embedding_layer = Embedding(
    vocab_size,
    300,
    weights=[emb],
    trainable=False,
    input_length=sequence_length
)

My model:

from keras import regularizers,layers

model = Sequential()
model.add(embedding_layer)
model.add(Bidirectional(layers.LSTM(512,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(512,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(256,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(256)))#kernel_regularizer=regularizers.l2(0.001)
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.2))
model.add(Dense(7, activation='softmax'))

For some reason when my training accuracy reached 80%, val. accuracy still remains very low. I have tried different regularization techniques, optimizers, loss functions, but the result is the same. I don't know why.

Please Help!!

Edit: The total no. of tokens are 2719 and the total no.of sentences (including test and train dataset ) are 2183.

Compiler: model.compile(optimizer='adam',         
loss='mean-squred-error',
metrics=['accuracy']
)

UPDATED STATS:

I have decreased the label size from 7 to 3 ie [0,1,0] -> +ve, neutral,-ve.

model = Sequential()
model.add(embedding_layer)
model.add(Bidirectional(layers.LSTM(16,activation='relu'))) 
model.add(Dropout(0.2))
model.add(Dense(3, activation='softmax'))

Compiled:

model.compile( 
optimizer=keras.optimizers.Adam(learning_rate=0.00005),
              loss='categorical_crossentropy',
              metrics = ['accuracy'])

Graphs:

Training:

But loss is still high and Also, I have stratified the dataset.

Answer 1

A couple of recommendations:

Use categorical_crossentropy instead of mean_squared_error , it can help you a lot when doing classification (although the latter could also work, the former also does it better).
Are all your labels mutually exclusive? If then, use softmax + categorical_crossentropy , otherwise (eg label appears like [1,0,0,0,0,0,1] use sigmoid + binary_crossentropy .
Decrease the size of the model initially, and only if the overfitting problem persists use Dropout() . Use only one layer of LSTM.
Reduce the number of units (even if you have one single LSTM cell ( 64 / 128 would probably suffice).
You can use bidirectional LSTM (I would even opt for bidirectional GRUs since they are simpler, to see how the performance behaves).
Ensure that you do a stratified split (in this way, certain examples definitely appear both in the training set and in the validation set, also keeping a good proportion.
Start with a small(er) learning rate ( 0.0001 / 0.00005 ).
Establish an objective/correct baseline. If your data is very little, particularly when working on a multi-modal dataset(you fetch only the "text"), you work only on text, with 7 different classes, then it is probable you will not reach a very high accuracy.

Bear in mind that, in order to have a reasonable final result in your case, you need to employ a data-centric approach, rather than a model-centric one. Regardless of the possible improvements, if the data is scarce + not comprehensive, you will not be able to achieve great results.

Answer 2

A large difference between Train and Validation stats typically indicates overfitting of models to the Train data.

To minimize this I do a few things

reduce the size of the model.
Add a few dropout or similar layers in the model. I have had good success with using these layers: layers.LeakyReLU(alpha=0.8),

See guidance here: https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#strategies_to_prevent_overfitting

Answer 3

How long is your dataset (how many sentences), 2179 tokens does not seems like much, it seems to me like your model is way too big for the task. I wouldn't add 4 layers of LSTM, I would go with 1 or 2.

from keras import regularizers,layers

model = Sequential()
model.add(embedding_layer)
model.add(Bidirectional(layers.LSTM(64,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(32)))
model.add(Dense(16, activation='relu'))
# model.add(Dropout(0.2))
model.add(Dense(7, activation='softmax'))

As for the training, 200 epoch seems long, if your model desn't seem to converge after 20 I would reset and try with a simpler architecture.

Validation accuracy is much less than Training accuracy

Question

Model Architecture:

UPDATED STATS:

3 answers

solution1
4 ACCPTED 2021-08-19 11:00:35

solution2
3 2021-08-11 19:22:10

solution3
2 2021-08-12 12:14:36

Validation accuracy is much less than Training accuracy

Question

Model Architecture:

UPDATED STATS:

3 answers

solution1 4 ACCPTED 2021-08-19 11:00:35

solution2 3 2021-08-11 19:22:10

solution3 2 2021-08-12 12:14:36

solution1
4 ACCPTED 2021-08-19 11:00:35

solution2
3 2021-08-11 19:22:10

solution3
2 2021-08-12 12:14:36