简体   繁体   English

Keras [文本多分类] - 训练和测试准确率高但预测差

[英]Keras [Text multi-classification] - Good accuracy in training and test but bad in prediction

I have been facing a several problems when trying to predict topics based on news articles.在尝试根据新闻文章预测主题时,我一直面临一些问题。 The news articles have been cleared (no pontuation, numbers, ... ).新闻文章已被清除(没有 pontuation,数字,...)。 There are 6 classes possible and I have a dataset of 13000 news articles per each class (Uniform distribution of the data set).可能有 6 个类,我有每个类的 13000 篇新闻文章的数据集(数据集的均匀分布)。

Pre-processing:预处理:


stop_words = set(stopwords.words('english')) 

for index, row in data.iterrows():
    print ("Index: ", index)
    txt_clean = ' '.join(re.sub("([^a-zA-Z ])", " ", data.loc[index,'txt_clean']).split()).lower()

    word_tokens = word_tokenize(txt_clean) 

    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    cleaned_text = ''

    for w in filtered_sentence:
        cleaned_text = cleaned_text + ' ' + w

    data.loc[index,'txt_clean'] = cleaned_text

I implemented a RNN using LSTM as the following:我使用 LSTM 实现了一个 RNN,如下所示:

model = Sequential()
    model.add(Embedding(50000, 100, input_length=500))
    model.add(SpatialDropout1D(0.2))
    model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(6, activation='softmax'))
    model.summary()

    model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

    history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.1)

    accr = model.evaluate(X_test,Y_test)
    print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Prediction:预言:

model = load_model('model.h5')
data = data.sample(n=15000)

model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy'])
tokenizer = Tokenizer(num_words=50000)
tokenizer.fit_on_texts(data['txt_clean'].values) (Prediction data sample values and not the same as in the training))

CATEGORIES = ['A','B','C','D','E','F']
for index, row in data.iterrows():


    seq = tokenizer.texts_to_sequences([data.loc[index,'txt_clean']])
    padded = pad_sequences(seq, maxlen=500)

    pred = model.predict(padded)
    pred = pred[0]
    print (pred, pred[np.argmax(pred)]))

For example after 10 epochs and a batch_size of 500:例如在 10 个 epochs 和 500 的 batch_size 之后:

  • Training acc: 0.831训练acc:0.831
  • Training loss: 0.513训练损失:0.513
  • Test acc: 0.714测试acc:0.714
  • Test loss: 0.907测试损失:0.907

Also tried reducing the number of batch_size to 64:还尝试将 batch_size 的数量减少到 64:

  • Training acc: 0.859训练acc:0.859
  • Training loss: 0.415训练损失:0.415
  • Test acc: 0.771测试acc:0.771
  • Test loss: 0.679测试损失:0.679

The results using 64 batch size seems to me better, BUT when I am predicting news article (one by one) I get an accuracy of 15.97%.使用 64 批量大小的结果在我看来更好,但是当我(一一)预测新闻文章时,我的准确率为 15.97%。 This accuracy of prediction is much much low comparing with the training and test.与训练和测试相比,这种预测的准确性要低得多。

What could be the problem?可能是什么问题呢?

Thanks!谢谢!

This is a classical problem that exists in ML or DL.这是 ML 或 DL 中存在的经典问题。 There can be a couple of reasons for it可能有几个原因

  1. Overfitting, try to make the model deeper or add some normalizations.过拟合,尝试使模型更深或添加一些归一化。
  2. Train and Test dataset are dissimilar训练和测试数据集不同
  3. Use the same preprocessing steps as used during training使用与训练期间相同的预处理步骤
  4. Class imbalance ie training dataset contains more data for a particular class which test dataset lacks.类不平衡,即训练数据集包含测试数据集缺乏的特定类的更多数据。
  5. Try to change model architecture by using Bidirectional LSTM or GRU尝试使用双向 LSTM 或 GRU 更改模型架构

Please try pickling your tokenizer using pickle or joblib to save your keras tokenizer use it for both training and prediction.请尝试使用picklejoblib腌制您的分词器以保存您的 keras 分词,并将其用于训练和预测。

Here is the sample code to save keras tokenizer :-这是保存 keras 标记器的示例代码:-

import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM