简体   繁体   English

文本预处理以进行分类-机器学习

[英]Text Preprocessing for classification - Machine Learning

what are important steps for preprocess our Twitter texts to classify between binary classes. 预处理我们的Twitter文本以在二进制类之间进行分类的重要步骤是什么? what I did is that I removed hashtag and keep it without hashtag, I also used some regular expression to remove special char, these are two function I used. 我所做的是删除了标签并保留了没有标签的功能,我还使用了一些正则表达式来删除特殊字符,这是我使用的两个函数。

def removeusername(tweet):
    return " ".join(word.strip() for word in re.split('@|_', tweet))
def removingSpecialchar(text):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())

what are other things to preprocess textdata. 还有什么其他东西可以预处理文本数据。 I have also used nltk stopword corpus to remove all stop words form the tokenize words. 我还使用了nltk停用词语料库来从标记化词中删除所有停用词。

I used NaiveBayes classifer in textblob to train data and I am getting 94% accuracy on training data and 82% on testing data. 我在textblob中使用了NaiveBayes classifer来训练数据,我在训练数据上的准确性达到94%,在测试数据上达到82%。 I want to know is there any other method to get good accuracies. 我想知道是否还有其他方法可以取得良好的准确性。 By the way I am new in this Machine Learning field, I have a limited idea about all of it! 顺便说一下,我是这个机器学习领域的新手,我对这一切都只有有限的想法!

Well then you can start by play with the size of your vocabulary. 好了,您可以从词汇量的大小入手。 You might exclude some of the words that are too frequent in your data (without being considered stop words). 您可能会排除一些数据中过于频繁的词(不被视为停用词)。 And also do the same with words that appear in only one tweet (misspelled words for example). 对于仅出现在一条推文中的字词(例如,拼写错误的字词),也是如此。 Sklearn CountVectorizer allow to do this in an easy way have a look min_df and max_df parameters. Sklearn CountVectorizer允许以简单的方式查看min_dfmax_df参数。

Since you are working with tweets you can also think in URL strings. 由于您正在使用推文,因此您也可以考虑URL字符串。 Try to obtain some valuable information from links, there are lots of different options from simple stuff based on regular expressions that retrieve the domain name of the page to more complex NLP based methods that study the link content. 尝试从链接中获取一些有价值的信息,有很多不同的选择,从基于正则表达式的简单内容(用于检索页面的域名)到更复杂的基于NLP的方法(用于研究链接内容)。 Once more it's up to you! 再一次取决于您!

I would also have a look at pronouns (if you are using sklearn) since by default replaces all of them to the keyword -PRON- . 我还会看看代词(如果使用的是sklearn),因为默认情况下会将其全部替换为关键字-PRON-。 This is a classic solution that simplifies things but might end in a loss of information. 这是一种经典的解决方案,可以简化事情,但最终可能会导致信息丢失。

For preprocessing raw data, you can try: 对于预处理原始数据,您可以尝试:

  • Stop word removal. 停止单词删除。
  • Stemming or Lemmatization. 摘除或拔除。
  • Exclude terms that are either too common or too rare. 排除太常见或太少的术语。

Then a second step preprocessing is possible: 然后可以进行第二步预处理:

  • Construct a TFIDF matrix. 构造一个TFIDF矩阵。
  • Construct or load pretrained wordEmbedding (Word2Vec, Fasttext, ...). 构造或加载预训练的wordEmbedding(Word2Vec,Fasttext等)。

Then you can load result of the second steps into your model. 然后,您可以将第二步的结果加载到模型中。

These are just the most common "method", many others exists. 这些只是最常见的“方法”,还存在许多其他方法。

I will let you check each one of these methods by yourself, but it is a good base. 我将让您自己检查每种方法,但这是一个很好的基础。

There are no compulsory steps. 没有强制性步骤。 For example, it is very common to remove stop words (also called functional words) such as "yes" , "no" , "with". 例如,删除停用词(也称为功能词)是很常见的,例如“ yes”,“ no”,“ with”。 But - in one of my pipelines, I skipped this step and the accuracy did not change. 但是-在我的一条管道中,我跳过了这一步,准确性没有改变。 NLP is an experimental field , so the most important advice is to build a pipeline that run as quickly as possible, to define your goal, and to train with different parameters. NLP是一个实验领域,因此最重要的建议是建立一条尽快运行的管道,定义您的目标并使用不同的参数进行训练。

Before you move on, you need to make sure you training set is proper. 在继续之前,您需要确保训练集正确。 What are you training for ? 你要训练什么? is your set clean (eg the positive has only positives)? 您的设备干净吗(例如,阳性只有阳性)? how do you define accuracy and why? 您如何定义准确性?为什么?

Now, the situation you described seems like a case of over-fitting. 现在,您描述的情况似乎是过度拟合的情况。 Why? 为什么? because you get 94% accuracy on the training set, but only 82% on the test set. 因为您在训练集上获得94%的准确性,但在测试集上仅获得82%的准确性。

This problem happens when you have a lot of features but relatively small training dataset - so the model is fitted best for the specific train set but fails to generalize. 当您具有许多功能但训练数据集相对较小时,就会发生此问题-因此该模型最适合特定的训练集,但无法推广。

Now, you did not specify the how large is your dataset, so I'm guessing between 50 and 500 tweets, which is too small given the English vocabulary of some 200k words or more. 现在,您没有指定数据集的大小,所以我猜测是在50到500条推文之间,考虑到英语词汇量约为200k或更多,这太小了。 I would try one of the following options: (1) Get more training data (at least 2000) (2) Reduce the number of features, for example you can remove uncommon words, names - anything words that appears only small number of times (3) Using a better classifier (Bayes is rather weak for NLP). 我会尝试以下选项之一:(1)获取更多训练数据(至少2000个)(2)减少功能数量,例如,您可以删除不常见的单词,名称-出现次数很少的任何单词( 3)使用更好的分类器(对于NLP,贝叶斯相当弱)。 Try SVM, or Deep Learning. 尝试使用SVM或深度学习。 (4) Try regularization techniques (4)尝试正则化技术

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM