简体繁体中英

Do I need to standardize data when doing text classification in Scikit

原文 2015-05-30 12:18:00 8 1 python/ r/ machine-learning/ nlp/ scikit-learn

I am developing a spam filter using Scikit. Here are the steps I follow:

Xdata = ["This is spam" , "This is Ham" , "This is spam again"]

Matrix = Countvectorizer (XData) . Matrix will contains count of each word in all documents. So Matrix[i][j] will give me counts of word j in document i
Matrix_idfX = TFIDFVectorizer(Matrix) . It will normalize score.
Matrix_idfX_Select = SelectKBest( Matrix_IdfX , 500) . It will reduce matrix to 500 best score columns
Multinomial.train( Matrix_Idfx_Select)

Now my question Do I need to perform normalization or standardization in any of the above four steps ? If yes then after which step and why?

Thanks

1 answers

You may want to normalize words before tokenization (stemming or lemmatization). See the related question for example.

NB: you don't need since "TfidfVectorizer combines all the options of CountVectorizer and TfidfTransformer in a single model" ( scikit docs ) Also note that "While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable." (the same docs)

Preparing text data for SciKit classification

Prepare data for text classification using Scikit Learn SVM

How to do classification in binary data set using scikit-learn?

Text classification with Scikit-learn

How do I standardize my data so that the Mean is 0?

How do I standardize a matrix?

text classification with SciKit-learn and a large dataset

Best scikit classifier for text classification task

What is the standard way in scikit-learn to arrange textual data for text classification?

How to classify new data using a pre-trained model - Python Text Classification (NLTK and Scikit)

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Preparing text data for SciKit classification Prepare data for text classification using Scikit Learn SVM How to do classification in binary data set using scikit-learn? Text classification with Scikit-learn How do I standardize my data so that the Mean is 0? How do I standardize a matrix? text classification with SciKit-learn and a large dataset Best scikit classifier for text classification task What is the standard way in scikit-learn to arrange textual data for text classification? How to classify new data using a pre-trained model - Python Text Classification (NLTK and Scikit)

Related Tags

Do I need to standardize data when doing text classification in Scikit

Question

1 answers

solution1 1 2015-05-30 13:05:16

solution1
1 2015-05-30 13:05:16