简体   繁体   中英

Document classification using LSA/SVD

I am trying to do document classification using Support Vector Machines (SVM). The documents I have are collection of emails. I have around 3000 documents to train the SVM classifier and have a test document set of around 700 for which I need classification.

I initially used binary DocumentTermMatrix as the input for SVM training. I got around 81% accuracy for the classification with the test data. DocumentTermMatrix was used after removing several stopwords.

Since I wanted to improve the accuracy of this model, I tried using LSA/SVD based dimensional reduction and use the resulting reduced factors as input to the classification model (I tried with 20, 50, 100 and 200 singular values from the original bag of ~ 3000 words). The performance of the classification worsened in each case. (Another reason for using LSA/SVD was to overcome memory issues with one of the response variable that had 65 levels).

Can someone provide some pointers on how to improve the performance of LSA/SVD classification? I realize this is general question without any specific data or code but would appreciate some inputs from the experts on where to start the debugging.

FYI, I am using R for doing the text preprocessing (packages: tm, snowball,lsa) and building classification models (package: kernelsvm)

Thank you.

This might not be the best tailored answer. Hope these suggestions may help.

Maybe you could use lemmatization over stemming to reduce unacceptable outcomes. Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

The goal of both stemming and lemmatization is to reduce inflectional forms and 
sometimes derivationally related forms of a word to a common base form. 

However, the two words differ in their flavor. Stemming usually refers to a crude 
heuristic process that chops off the ends of words in the hope of achieving this 
goal correctly most of the time, and often includes the removal of derivational 
affixes. Lemmatization usually refers to doing things properly with the use of a 
vocabulary and morphological analysis of words, normally aiming to remove 
inflectional endings only and to return the base or dictionary form of a word, 
which is known as the lemma.

One instance:

go,goes,going ->Lemma: go,go,go ||Stemming: go, goe, go

And use some predefined set of rules; such that short term words are generalized. For instance:

I'am -> I am
should't -> should not
can't -> can not

How to deal with parentheses inside a sentence.

This is a dog(Its name is doggy)

Text inside parentheses often referred to alias names of the entities mentioned. You can either removed them or do correference analysis and treat it as a new sentence.

Try to useLocal LSA , which can improve the classification process compared to Global LSA. In addition, LSA's power depends entirely on its parameters, so try to tweak parameters (start with 1, then 2 or more) and compare results to enhance the performance.

Here's some general advice - nothing specific to LSA, but it might help improving the results nonetheless.

  1. 'binary documentMatrix' seems to imply your data is represented by binary values, ie 1 for a term existing in a document, and 0 for non-existing term; moving to other scoring scheme (eg tf/idf) might lead to better results.

  2. LSA is a good metric for dimensional reduction in some cases, but less so in others. So depending in the exact nature of your data, it might be a good idea to consider additional methods, eg Infogain.

  3. If the main incentive for reducing the dimensionality is the one parameter with 65 levels, maybe treating this parameter specifically, eg by some form of quantization, would lead to a better tradeoff?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM