简体   繁体   English

嵌入层模型中的测试数据给出Keras中的预测误差

[英]Test data giving prediction error in Keras in the model with Embedding layer

I have trained a Bi-LSTM model to find NER on a set of sentences. 我训练了一个Bi-LSTM模型以在一组句子上找到NER。 For this I took the different words present and I did a mapping between a word and a number and then created the Bi-LSTM model using those numbers. 为此,我使用了出现的不同单词,并在单词和数字之间进行了映射,然后使用这些数字创建了Bi-LSTM模型。 I then create and pickle that model object. 然后创建并腌制该模型对象。

Now I get a set of new sentences containing certain words that the training model has not seen. 现在,我得到了一组新句子,其中包含训练模型尚未看到的某些单词。 Thus these words do not have a numeric value till now. 因此,这些单词到目前为止还没有数值。 Thus when I test it on my previously existing model, it would give an error. 因此,当我在以前的现有模型上对其进行测试时,会出现错误。 It is not able to find the words or features as the numeric values for those do not exist. 无法找到单词或特征,因为这些单词或特征的数字值不存在。

To circumvent this error I gave a new integer value to all the new words that I see. 为了避免此错误,我为所有看到的新单词赋予了一个新的整数值。

However, when I load the model and test it, it gives the error that: 但是,当我加载模型并对其进行测试时,会出现以下错误:

InvalidArgumentError: indices[0,24] = 5444 is not in [0, 5442)   [[Node: embedding_14_16/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_14_16/embeddings/read, embedding_14_16/Cast)]]

The training data contains 5445 words including the padding word. 训练数据包含5445个单词,包括填充单词。 Thus = [0, 5444] 因此= [0,5444]

5444 is the index value I have given to the paddings in the test sentences. 5444是我为测试语句中的填充赋予的索引值。 Not clear why it is assuming the index values to range between [0, 5442). 不清楚为什么假设索引值在[0,5442)之间。

I have used the base code available on the following link: https://www.kaggle.com/gagandeep16/ner-using-bidirectional-lstm 我已使用以下链接上提供的基本代码: https : //www.kaggle.com/gagandeep16/ner-using-bidirectional-lstm

The code: 编码:

input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words, output_dim=50
                  , input_length=max_len)(input)

model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model)  # softmax output layer

model = Model(input, out)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

#number of  epochs - Also for output file naming
epoch_num=20
domain="../data/Laptop_Prediction_Corrected"
output_file_name=domain+"_E"+str(epoch_num)+".xlsx"

model_name="../models/Laptop_Prediction_Corrected"
output_model_filename=model_name+"_E"+str(epoch_num)+".sav"


history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=epoch_num, validation_split=0.1, verbose=1)

max_len is the total number of words in a sentence and n_words is the vocab size. max_len是句子中单词的总数, n_words是词汇的大小。 In the model the padding has been done using the following code where n_words=5441 : 在模型中,填充使用以下代码完成,其中n_words=5441

X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words)

The padding in the new dataset: 新数据集中的填充:

max_len = 50
# this is to pad sentences to the maximum length possible
#-> so all records of X will be of the same length

#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=res_new_word2idx["pad_blank"])

#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=5441)

Not sure which of these paddings is correct? 不知道这些填充中的哪个正确?

However, the vocab only includes the words in the training data. 但是,词汇仅在训练数据中包含单词。 When I say: 当我说:

p = loaded_model.predict(X)

How to use predict for text sentences which contain words that are not present in the initial vocab? 如何为包含在初始词汇中不存在的单词的文本句子使用predict

You can use Keras Tokenizer class and its methods to easily tokenize and preprocess the input data. 您可以使用Keras Tokenizer类及其方法轻松地对输入数据进行令牌化和预处理。 Specify the vocab size when instantiating it and then use its fit_on_texts() method on the training data to construct a vocabulary based on the given texts. 在实例化时指定vocab的大小,然后在训练数据上使用其fit_on_texts()方法根据给定的文本构建词汇表。 After that you can use its text_to_sequences() method to convert each text string to a list of word indices. 之后,您可以使用其text_to_sequences()方法将每个文本字符串转换为单词索引列表。 The good thing is that only the words in the vocabulary is considered and all the other words are ignored (you can set those words to one by passing oov_token=1 to Tokenizer class): 好消息是只考虑词汇表中的单词,而忽略所有其他单词(您可以通过将oov_token=1传递给Tokenizer类将这些单词设置为一个):

from keras.preprocessing.text import Tokenizer

# set num_words to limit the vocabulary to the most frequent words
tok = Tokenizer(num_words=n_words)

# you can also pass an arbitrary token as `oov_token` argument 
# which will represent out-of-vocabulary words and its index would be 1
# tok = Tokenizer(num_words=n_words, oov_token='[unk]')

tok.fit_on_texts(X_train)

X_train = tok.text_to_sequences(X_train)
X_test = tok.text_to_sequences(X_test)  # use the same vocab to convert test data to sequences

You can optionally use pad_sequences function to pad them with zeros or truncate them to make them all have the same length: 您可以选择使用pad_sequences函数用零填充它们或截断它们以使它们具有相同的长度:

from keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

Now, the vocab size would be equal to n_words+1 if you have not used oov token or n_words+2 if you have used it. 现在,如果您未使用oov令牌,则n_words+2大小将等于n_words+1 ,或者如果已使用,则n_words+2 And then you can pass the correct number to embedding layer as its input_dim argument (first positional argument): 然后,您可以将正确的数字作为其input_dim参数(第一个位置参数)传递给嵌入层:

Embedding(correct_num_words, embd_size, ...)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM