简体   繁体   中英

Text Preprocessing for classification - Machine Learning

what are important steps for preprocess our Twitter texts to classify between binary classes. what I did is that I removed hashtag and keep it without hashtag, I also used some regular expression to remove special char, these are two function I used.

def removeusername(tweet):
    return " ".join(word.strip() for word in re.split('@|_', tweet))
def removingSpecialchar(text):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())

what are other things to preprocess textdata. I have also used nltk stopword corpus to remove all stop words form the tokenize words.

I used NaiveBayes classifer in textblob to train data and I am getting 94% accuracy on training data and 82% on testing data. I want to know is there any other method to get good accuracies. By the way I am new in this Machine Learning field, I have a limited idea about all of it!

Well then you can start by play with the size of your vocabulary. You might exclude some of the words that are too frequent in your data (without being considered stop words). And also do the same with words that appear in only one tweet (misspelled words for example). Sklearn CountVectorizer allow to do this in an easy way have a look min_df and max_df parameters.

Since you are working with tweets you can also think in URL strings. Try to obtain some valuable information from links, there are lots of different options from simple stuff based on regular expressions that retrieve the domain name of the page to more complex NLP based methods that study the link content. Once more it's up to you!

I would also have a look at pronouns (if you are using sklearn) since by default replaces all of them to the keyword -PRON- . This is a classic solution that simplifies things but might end in a loss of information.

For preprocessing raw data, you can try:

  • Stop word removal.
  • Stemming or Lemmatization.
  • Exclude terms that are either too common or too rare.

Then a second step preprocessing is possible:

  • Construct a TFIDF matrix.
  • Construct or load pretrained wordEmbedding (Word2Vec, Fasttext, ...).

Then you can load result of the second steps into your model.

These are just the most common "method", many others exists.

I will let you check each one of these methods by yourself, but it is a good base.

There are no compulsory steps. For example, it is very common to remove stop words (also called functional words) such as "yes" , "no" , "with". But - in one of my pipelines, I skipped this step and the accuracy did not change. NLP is an experimental field , so the most important advice is to build a pipeline that run as quickly as possible, to define your goal, and to train with different parameters.

Before you move on, you need to make sure you training set is proper. What are you training for ? is your set clean (eg the positive has only positives)? how do you define accuracy and why?

Now, the situation you described seems like a case of over-fitting. Why? because you get 94% accuracy on the training set, but only 82% on the test set.

This problem happens when you have a lot of features but relatively small training dataset - so the model is fitted best for the specific train set but fails to generalize.

Now, you did not specify the how large is your dataset, so I'm guessing between 50 and 500 tweets, which is too small given the English vocabulary of some 200k words or more. I would try one of the following options: (1) Get more training data (at least 2000) (2) Reduce the number of features, for example you can remove uncommon words, names - anything words that appears only small number of times (3) Using a better classifier (Bayes is rather weak for NLP). Try SVM, or Deep Learning. (4) Try regularization techniques

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM