LSTM單詞預測模型僅預測最頻繁的單詞，或用於不平衡數據的損失

Question

我決定嘗試使用遞歸神經網絡構建單詞預測模型。 在線上有許多不同的示例，包括在線課程，這聽起來很容易建立這樣的模型。 他們大多數使用LSTM。 而且，大多數（如果不是全部）使用很小的數據集。 我決定嘗試使用更大的數據集，即from sklearn.datasets import fetch_20newsgroups的20個新聞組數據集。 我做了一些最少的預處理：刪除標點符號，停用詞和數字。

我正在根據之前的10個單詞歷史預測一個單詞。 我只使用至少11個單詞的帖子。 對於每個帖子，我都采用一個大小為11的滑動窗口並將其沿帖子滑動來構建訓練集。 對於每個位置，前10個單詞都是預測變量，第11個單詞是目標單詞。 我整理了一個簡單的模型：嵌入層，LSTM層和輸出Dense層。 這是代碼：

def make_prediction_sequences(input_texts, max_nb_words, sequence_length = 10):
# input_texts is a list of strings/texts

# select top vocab_size words based on the word counts
# word_index is the dictionary used to transform the words into the tokens. 
    tokenizer = Tokenizer(oov_token='UNK',num_words=max_nb_words)
    tokenizer.fit_on_texts(input_texts)
    sequences = tokenizer.texts_to_sequences(input_texts)

    prediction_sequences = []
    for sequence in sequences:
        if len(sequence) > sequence_length: # at least 1 for prediction
            for j in range(0,len(sequence) - sequence_length):
                prediction_sequences.append(sequence[j:sequence_length+j+1])

    word_index = {e:i-1 for e,i in tokenizer.word_index.items()  if i <= max_nb_words} # i-1 because tokenizer is 1 indexed


    return (np.array(prediction_sequences) , word_index)

def batch_sequence_data(prediction_sequences, batch_size, sequence_length, vocab_size):
    number_batches = int(len(prediction_sequences)/batch_size)
    while True:
        for i in range(number_batches):
            X = prediction_sequences[i*batch_size:(i+1)*batch_size, 0:sequence_length]
            Y = to_categorical(prediction_sequences[i*batch_size:(i+1)*batch_size, sequence_length], num_classes=vocab_size)
            yield np.array(X),Y

VOCAB_SIZE = 15000
SEQUENCE_LENGTH = 10
BATCH_SIZE = 128
prediction_sequences, word_index = make_prediction_sequences(data, VOCAB_SIZE, sequence_length=SEQUENCE_LENGTH)

## define the model
EMBEDDING_DIM = 64
rnn_size = 32

sequence_input = Input(shape=(SEQUENCE_LENGTH,), dtype='int32', name='rnn_input')
embedding_layer = Embedding(len(word_index), EMBEDDING_DIM, input_length=SEQUENCE_LENGTH)
embedded_sequences = embedding_layer(sequence_input)
x = LSTM(rnn_size, use_bias=True)(embedded_sequences)
preds = Dense(VOCAB_SIZE, activation='softmax')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['categorical_accuracy'])

#train the model
steps_per_epoch = len(prediction_sequences)/(BATCH_SIZE * SEQUENCE_LENGTH)
earlystop = EarlyStopping(patience=3, restore_best_weights=True,monitor='loss')
history = model.fit_generator(batch_sequence_data(prediction_sequences, BATCH_SIZE, SEQUENCE_LENGTH, VOCAB_SIZE), 
                    steps_per_epoch = steps_per_epoch, epochs=30, callbacks=[earlystop])

訓練達到〜0.1的精度。 當我應用該模型從訓練數據中預測10個單詞片段的單詞時，輸出絕大多數是最常見的單詞“一個”。

我嘗試了一個更復雜的模型，其中包含2個LSTM層，2個Dense層。 我嘗試使用通過gensim word2vec模型進行的預訓練詞嵌入。 准確度始終為〜0.1，大多數情況下，預測為“一”。

當我考慮時，這很有道理。 預測不平衡數據的最頻繁分類將提供“免費”的高精度。 顯然這是一個局部最小值，但很難逃脫。 問題是，該算法並沒有使准確性降到最低，而是使損失降到了最低，這就是categoricall_crossentropy，並且對於不平衡的數據也可以正常工作。 但是，也許並非總是如此，是否有其他損失可以用來更好地處理不平衡的數據？

Answer 1

在四處尋找之后，我發現了一篇介紹焦距損失的研究論文，並且方便地介紹了它在keras中的github實現。

結合@meowongac的建議（我使用了Google word2vec嵌入），可以更好地采樣頻率較低的單詞。

我也分別使用class_weight ：

model.fit_generator(batch_sequence_data(prediction_sequences, 
                    BATCH_SIZE, SEQUENCE_LENGTH, VOCAB_SIZE), 
                    steps_per_epoch = steps_per_epoch, epochs=30, callbacks=[earlystop],
                    class_weight = class_weight)

我將其設置為與詞頻成反比。 同樣，在某種意義上，結合使用Google詞嵌入，它甚至可以更好地產生頻率較低的詞。

例如，對於10個單詞的序列：

['two', 'three', 'marines', 'sort', 'charges', 'pending', 'another', 'fight', 'week', 'interesting']

伽馬= 5的震源損失法預測了下一people ，class_weight法預測了attorney

LSTM單詞預測模型僅預測最頻繁的單詞，或用於不平衡數據的損失

問題描述

1 個解決方案

解決方案1
0 2019-07-25 18:29:24

LSTM單詞預測模型僅預測最頻繁的單詞，或用於不平衡數據的損失

問題描述

1 個解決方案

解決方案1 0 2019-07-25 18:29:24

解決方案1
0 2019-07-25 18:29:24