简体   繁体   中英

How many documents to train on for naive bayes?

I just created my own Naive Bayes model from scratch and trained it on 776 documents. I tried classifying the documents but it's classified the documents wrong on all three of the test documents. The category that it should have been even had the lowest of all probabilities against the other categories (this is for two of the three test documents).

Should I increase the number of training documents? I don't think it's my code because i checked the computation but I don't know, maybe the compute_numerators function is wrong somehow?? For the numerator part I used logs because of the underflow problem and summed up the probabilities of the terms and the probability of (number_of_documents_in_category/overall_number_of_documents) http://i.stack.imgur.com/GIwIp.png

Super confused and discouraged since this took me so long and now I feel like it was for nothing because it didn't even classify ONE document correctly :(

@Bob Dillon, Hi, thank you for your thorough reply. my biggest question from this was what you mean by separable. Do you mean if there is a clear distinction of the documents between the classes? I don't really know how to answer that. The data was classified by humans so the separation is possible, but maybe it's so close to other types of categories that it's getting blurred? Maybe the computer doesn't recognize a difference in the words used that are classified as one thing vs another category? I have to keep those categories, I cannot rearrange the categories, they must be as is. I am not sure how to prototype in R, wouldn't I still need to take in the text data and run it? wouldn't I still need to create a tokenization etc? I'm going to look into information gain and SVM. I will probably post back. Thanks!

I just created my own Naive Bayes model from scratch and trained it on 776 documents

Naive Bayes, likes its name says, is a naive algorithm. It's very bad compared to modern methods, like support vector machines or (deep) neural networks. You should keep this in mind when using it: expect better results than tossing a coin would give you, but not by very much.

tried classifying the documents but it's classified the documents wrong on all three of the test documents

Only three test documents? This is very little, and tells you nothing. If you have x total documents, you should use at least 20% for testing. Also consider using cross validation.

Should I increase the number of training documents?

This will help, yes. A golden rule of thumb in machine learning is that more data will usually beat a better algorithm. Of course, we can't always get more data, or we can't afford the processing power to use more data, so better algorithms are important.

To be able to see an improvement though, you'll need to use more testing data as well.

In conclusion: test on more data. If you have 779 documents, use at least 100 for testing or do cross validation . If you get above 50-60% accuracy, be happy, that's good enough for this amount of data and Naive Bayes.

You have a lot working against you.

  1. Weak dimensionality reduction - stop word filtering only
  2. Multi-class classification
  3. Weak classifier
  4. Little training data

You're showing us the code that you're using, but if the data is not separable, then nothing will sort it. Are you sure that the data can be classified? If so, what performance do you expect?

You should try prototyping your system before jumping to implementation. Using Octave, R or MatLab is a good place to start. Make sure your data is separable and the algorithm is effective on your data. Others have suggested using SVM and Neural Nets rather than Naive Bayes classification. That's a good suggestion. Each takes a bit of tweaking to get best performance. I've used Google Prediction API as a first order check of the performance that I can expect from a system and then replace it with SVM or another classifier to optimize performance and reduce cost/latency/etc. It's good to get a baseline as quickly and easily as possible before diving too deep.

If the data is separable, the more help you give the system the better it will perform. Feature/dimensionality reduction removes noise and helps the classifier to perform well. There is statistical analysis that you can do to reduce feature set. I like Information Gain, but there are others.

I found this paper to be a good theoretical treatment of text classification, including feature reduction.

I've been successful using Information Gain for feature reduction and found this paper to be a very good practical guide.

As for the amount of training data, that is not so clear cut. More is typically better, but the quality of the data is very important too. If the data is not easily separable or the underlying probability distribution is not similar to your test and wild data then performance will be poor even with more data. Put another way, quantity of training data matters, but quality matters at least as much.

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM