嵌入層模型中的測試數據給出Keras中的預測誤差

Question

我訓練了一個Bi-LSTM模型以在一組句子上找到NER。 為此，我使用了出現的不同單詞，並在單詞和數字之間進行了映射，然后使用這些數字創建了Bi-LSTM模型。 然后創建並腌制該模型對象。

現在，我得到了一組新句子，其中包含訓練模型尚未看到的某些單詞。 因此，這些單詞到目前為止還沒有數值。 因此，當我在以前的現有模型上對其進行測試時，會出現錯誤。 無法找到單詞或特征，因為這些單詞或特征的數字值不存在。

為了避免此錯誤，我為所有看到的新單詞賦予了一個新的整數值。

但是，當我加載模型並對其進行測試時，會出現以下錯誤：

InvalidArgumentError: indices[0,24] = 5444 is not in [0, 5442)   [[Node: embedding_14_16/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_14_16/embeddings/read, embedding_14_16/Cast)]]

訓練數據包含5445個單詞，包括填充單詞。 因此= [0，5444]

5444是我為測試語句中的填充賦予的索引值。 不清楚為什么假設索引值在[0，5442）之間。

我已使用以下鏈接上提供的基本代碼： https : //www.kaggle.com/gagandeep16/ner-using-bidirectional-lstm

編碼：

input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words, output_dim=50
                  , input_length=max_len)(input)

model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model)  # softmax output layer

model = Model(input, out)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

#number of  epochs - Also for output file naming
epoch_num=20
domain="../data/Laptop_Prediction_Corrected"
output_file_name=domain+"_E"+str(epoch_num)+".xlsx"

model_name="../models/Laptop_Prediction_Corrected"
output_model_filename=model_name+"_E"+str(epoch_num)+".sav"


history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=epoch_num, validation_split=0.1, verbose=1)

max_len是句子中單詞的總數， n_words是詞匯的大小。 在模型中，填充使用以下代碼完成，其中n_words=5441 ：

X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words)

新數據集中的填充：

max_len = 50
# this is to pad sentences to the maximum length possible
#-> so all records of X will be of the same length

#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=res_new_word2idx["pad_blank"])

#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=5441)

不知道這些填充中的哪個正確？

但是，詞匯僅在訓練數據中包含單詞。 當我說：

p = loaded_model.predict(X)

如何為包含在初始詞匯中不存在的單詞的文本句子使用predict ？

Answer 1

您可以使用Keras Tokenizer類及其方法輕松地對輸入數據進行令牌化和預處理。 在實例化時指定vocab的大小，然后在訓練數據上使用其fit_on_texts()方法根據給定的文本構建詞匯表。 之后，您可以使用其text_to_sequences()方法將每個文本字符串轉換為單詞索引列表。 好消息是只考慮詞匯表中的單詞，而忽略所有其他單詞（您可以通過將oov_token=1傳遞給Tokenizer類將這些單詞設置為一個）：

from keras.preprocessing.text import Tokenizer

# set num_words to limit the vocabulary to the most frequent words
tok = Tokenizer(num_words=n_words)

# you can also pass an arbitrary token as `oov_token` argument 
# which will represent out-of-vocabulary words and its index would be 1
# tok = Tokenizer(num_words=n_words, oov_token='[unk]')

tok.fit_on_texts(X_train)

X_train = tok.text_to_sequences(X_train)
X_test = tok.text_to_sequences(X_test)  # use the same vocab to convert test data to sequences

您可以選擇使用pad_sequences函數用零填充它們或截斷它們以使它們具有相同的長度：

from keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

現在，如果您未使用oov令牌，則n_words+2大小將等於n_words+1 ，或者如果已使用，則n_words+2 。 然后，您可以將正確的數字作為其input_dim參數（第一個位置參數）傳遞給嵌入層：

Embedding(correct_num_words, embd_size, ...)

嵌入層模型中的測試數據給出Keras中的預測誤差

問題描述

1 個解決方案

解決方案1
1 已采納 2018-11-12 20:39:39

嵌入層模型中的測試數據給出Keras中的預測誤差

問題描述

1 個解決方案

解決方案1 1 已采納 2018-11-12 20:39:39

解決方案1
1 已采納 2018-11-12 20:39:39