简体   繁体   中英

SVM and NN Model overfitting on large data

I have trained SVM and NN model using sklearn for two class. One class have 24000 tweets and another 32000 tweets.

When I do validation it gives like this

For -

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf',TfidfTransformer(use_idf=True)),('clf',MLPClassifier(activation="relu", solver='adam', alpha=0.001, hidden_layer_sizes=(5, 2), random_state=1)),])

              precision    recall  f1-score   support

    disaster       1.00      1.00      1.00     12862
 nondisaster       1.00      1.00      1.00      9543

   micro avg       1.00      1.00      1.00     22405
   macro avg       1.00      1.00      1.00     22405
weighted avg       1.00      1.00      1.00     22405

For

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf',TfidfTransformer(use_idf=True)),('clf',SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, verbose=1)),])
            text_clf.fit(X_train, y_train)

              precision    recall  f1-score   support

    disaster       1.00      1.00      1.00      6360
 nondisaster       1.00      1.00      1.00      4842

   micro avg       1.00      1.00      1.00     11202
   macro avg       1.00      1.00      1.00     11202
weighted avg       1.00      1.00      1.00     11202

When I change alpha value in NN model from 0.001 to 0.00001

              precision    recall  f1-score   support

    disaster       1.00      0.99      0.99     12739
 nondisaster       0.98      1.00      0.99      9666

   micro avg       0.99      0.99      0.99     22405
   macro avg       0.99      0.99      0.99     22405
weighted avg       0.99      0.99      0.99     22405

When I test few records, it is always biased to one class. For example SVM was predicting every input to non-disaster and NN does it to disaster class.

Any idea or suggestion how can I fine tune this model?

As far as I have seen, this happens when the dataset is biased. I believe in the concept of Garbage in - Garbage out.

It would be good for you to visualize your train-test data. I believe it would be biased.

Having said that, assuming your use case to be disaster prediction from tweets, it is understandable that if you take a random set of tweets, not even 1 out of 1000 is going to be about a disaster.

Hence, it would be wise to scope down your query to a refined topic and users so that you get a good enough dataset.

Thoughts?

Thanks Arun

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM