简体   繁体   中英

Do I need to standardize data when doing text classification in Scikit

I am developing a spam filter using Scikit. Here are the steps I follow:

Xdata = ["This is spam" , "This is Ham" , "This is spam again"]

  1. Matrix = Countvectorizer (XData) . Matrix will contains count of each word in all documents. So Matrix[i][j] will give me counts of word j in document i

  2. Matrix_idfX = TFIDFVectorizer(Matrix) . It will normalize score.

  3. Matrix_idfX_Select = SelectKBest( Matrix_IdfX , 500) . It will reduce matrix to 500 best score columns

  4. Multinomial.train( Matrix_Idfx_Select)

Now my question Do I need to perform normalization or standardization in any of the above four steps ? If yes then after which step and why?

Thanks

You may want to normalize words before tokenization (stemming or lemmatization). See the related question for example.

NB: you don't need since "TfidfVectorizer combines all the options of CountVectorizer and TfidfTransformer in a single model" ( scikit docs ) Also note that "While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable." (the same docs)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM