Predicting from a trained LSTM model

Question

I have trained a model using LSTM, on some data I have collected. I wanted to categorise as either Canine or Feline.

I am attempting to predict a string of text like so

json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("lstm.hd5")
print("Loaded model from disk")


text_to_predict = ['A 2‐year‐old male domestic shorthair cat was presented for a progressive history of abnormal posture, behavior, and mentation. Menace response was absent bilaterally, and generalized tremors were identified on neurological examination. A neuroanatomical diagnosis of diffuse brain dysfunction was made. A neurodegenerative disorder was suspected. Magnetic resonance imaging findings further supported the clinical suspicion. Whole‐genome sequencing of the affected cat with filtering of variants against a database of unaffected cats was performed. Candidate variants were confirmed by Sanger sequencing followed by genotyping of a control population. Two homozygous private (unique to individual or families and therefore absent from the breed‐matched controlled population) protein‐changing variants in the major facilitator superfamily domain 8 (MFSD8) gene, a known candidate gene for neuronal ceroid lipofuscinosis type 7 (CLN7), were identified. The affected cat was homozygous for the alternative allele at both variants. This is the first report of a pathogenic alteration of the MFSD8 gene in a cat strongly suspected to have CLN7.']




MAX_SEQUENCE_LENGTH = 352
MAX_NB_WORDS = 2000

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ')
seq = tokenizer.texts_to_sequences(text_to_predict)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = loaded_model.predict(padded)
labels = ['canine', 'feline']
print(pred, labels[np.argmax(pred)])

However, the predictions all come back the same, irrespective of what the string I choose to classify.

[[0.5212073 0.47879276]] canine

I am also unsure as to why I have to set the MAX_SEQUENCE_LENGTH to 352, as it seems my model is expecting an array of that size. Setting it to any other value returns an error of

ValueError: Error when checking input: expected embedding_1_input to have shape (352,) but got array with shape (250,)

My Model training, for reference, is done through this code.

data = pd.read_csv('data.csv')
data['Text'] = data['Text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

MAX_NB_WORDS = 2000
embed_dim = 128
lstm_out = 196

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ')
tokenizer.fit_on_texts(data['Text'].values)
X = tokenizer.texts_to_sequences(data['Text'].values)
X = pad_sequences(X)


model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

print('model string has been saved')

Y =  data[['canine','feline']]
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

batch_size = 32
model.fit(X_train, Y_train, epochs = 30, batch_size=batch_size, verbose = 2)

#save model for future use.
model.save('lstm.hd5')

Any help would be greatly appreciated :D

Answer 1

From your question, I understand that the Model is predicting correctly after Training but it is Training Same Class after Loading the Saved Model .

I recently faced the same issue and the solution to this problem is to Save the Tokenizer , with which the Model was Trained, in a Pickle File and Load the Pickle File when we want to perform Predictions after Loading the Saved Model .

Code for Saving the Tokenizer in a Pickle File:

import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

Code for Loading the Pickle File:

with open('tokenizer.pickle', 'rb') as handle:
    tokenizer2 = pickle.load(handle)

In addition to the above code, Some other observations from your code are:

It is recommended to use same Padding while Training the Model and while performing Predictions on the Loaded Model.

So, you can change the code from

X = pad_sequences(X)

to

X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)

The Values of MAX_SEQUENCE_LENGTH and MAX_NB_WORDS should be the same before and after Loading the Model
It is recommended to perform same Data Preprocessing steps before and after Loading the Model. So, you can apply the function, (lambda x: re.sub('[^a-zA-z0-9\\s]','',x)) after Loading the Model as well.

The Code, which should work fine is mentioned below:

data = pd.read_csv('data.csv')
data['Text'] = data['Text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

MAX_NB_WORDS = 2000
embed_dim = 128
lstm_out = 196

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ')
tokenizer.fit_on_texts(data['Text'].values)

import pickle  # IMPORTANT STEP

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

X = tokenizer.texts_to_sequences(data['Text'].values)
X = pad_sequences(X, maxlen = MAX_SEQUENCE_LENGTH) # Change Number 2

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

print('model string has been saved')

Y =  data[['canine','feline']]
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

batch_size = 32
model.fit(X_train, Y_train, epochs = 30, batch_size=batch_size, verbose = 2)

#save model for future use.
model.save('lstm.hd5')

Modified Code of the Loaded Model is shown below:

json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("lstm.hd5")
print("Loaded model from disk")


text_to_predict = ['A 2‐year‐old male domestic shorthair cat was presented for a progressive history of abnormal posture, behavior, and mentation. Menace response was absent bilaterally, and generalized tremors were identified on neurological examination. A neuroanatomical diagnosis of diffuse brain dysfunction was made. A neurodegenerative disorder was suspected. Magnetic resonance imaging findings further supported the clinical suspicion. Whole‐genome sequencing of the affected cat with filtering of variants against a database of unaffected cats was performed. Candidate variants were confirmed by Sanger sequencing followed by genotyping of a control population. Two homozygous private (unique to individual or families and therefore absent from the breed‐matched controlled population) protein‐changing variants in the major facilitator superfamily domain 8 (MFSD8) gene, a known candidate gene for neuronal ceroid lipofuscinosis type 7 (CLN7), were identified. The affected cat was homozygous for the alternative allele at both variants. This is the first report of a pathogenic alteration of the MFSD8 gene in a cat strongly suspected to have CLN7.']

text_to_predict = text_to_predict.apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x))) # CHANGE 3

MAX_SEQUENCE_LENGTH = 352
MAX_NB_WORDS = 2000

# Loading the Pickle File ==> IMPORTANT STEP
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer2 = pickle.load(handle)

# tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ') # THIS IS NOT REQUIRED
seq = tokenizer2.texts_to_sequences(text_to_predict)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = loaded_model.predict(padded)
labels = ['canine', 'feline']
print(pred, labels[np.argmax(pred)])

Please reach out if these changes doesn't give you the desired output and I will Happy to help you.

Hope this helps. Happy Learning!

Predicting from a trained LSTM model

Question

1 answers

solution1
0 2020-05-29 12:11:30

Predicting from a trained LSTM model

Question

1 answers

solution1 0 2020-05-29 12:11:30

solution1
0 2020-05-29 12:11:30