I have trained SVM and NN model using sklearn for two class. One class have 24000 tweets and another 32000 tweets.
When I do validation it gives like this
For -
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf',TfidfTransformer(use_idf=True)),('clf',MLPClassifier(activation="relu", solver='adam', alpha=0.001, hidden_layer_sizes=(5, 2), random_state=1)),])
precision recall f1-score support
disaster 1.00 1.00 1.00 12862
nondisaster 1.00 1.00 1.00 9543
micro avg 1.00 1.00 1.00 22405
macro avg 1.00 1.00 1.00 22405
weighted avg 1.00 1.00 1.00 22405
For
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf',TfidfTransformer(use_idf=True)),('clf',SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, verbose=1)),])
text_clf.fit(X_train, y_train)
precision recall f1-score support
disaster 1.00 1.00 1.00 6360
nondisaster 1.00 1.00 1.00 4842
micro avg 1.00 1.00 1.00 11202
macro avg 1.00 1.00 1.00 11202
weighted avg 1.00 1.00 1.00 11202
When I change alpha
value in NN model from 0.001 to 0.00001
precision recall f1-score support
disaster 1.00 0.99 0.99 12739
nondisaster 0.98 1.00 0.99 9666
micro avg 0.99 0.99 0.99 22405
macro avg 0.99 0.99 0.99 22405
weighted avg 0.99 0.99 0.99 22405
When I test few records, it is always biased to one class. For example SVM was predicting every input to non-disaster and NN does it to disaster class.
Any idea or suggestion how can I fine tune this model?
As far as I have seen, this happens when the dataset is biased. I believe in the concept of Garbage in - Garbage out.
It would be good for you to visualize your train-test data. I believe it would be biased.
Having said that, assuming your use case to be disaster prediction from tweets, it is understandable that if you take a random set of tweets, not even 1 out of 1000 is going to be about a disaster.
Hence, it would be wise to scope down your query to a refined topic and users so that you get a good enough dataset.
Thoughts?
Thanks Arun
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.