简体繁体 English

情绪分类与监督学习

[英]sentiment classification with supervised learning

原文 2015-10-18 09:31:48 6 1 python/ machine-learning/ scikit-learn/ sentiment-analysis/ text-classification

I am doing sentiment classification on blogs from Livejournal with python's scikit-learn. 我正在使用Python的scikit-learn在Livejournal的博客上进行情感分类。 I have around 40000 posts and I use 4/5 of them as training set and the remains as test set. 我大约有40000个帖子，我将其中的4/5用作训练集，其余作为测试集。

There are 6 sentiments: ['joy','sadness','anger','surprise','love','fear'] 有6个情感： ['joy','sadness','anger','surprise','love','fear']

I experiented with several classifiers(including naive bayes,svm,sgd..) but the problem is that the prediction is very very inaccurate. 我遇到过几个分类器（包括朴素贝叶斯，svm，sgd ..），但是问题是预测非常不准确。 Actually it is nearly trivial, because almost every blog in the test set is predicted to 'joy', which is the most frequent sentiment in the train set (45%). 实际上，这几乎是微不足道的，因为测试集中几乎每个博客都被预测为“欢乐”，这是火车中最常见的情绪（45％）。

The feature set basicly includes bag of words features(I tried unigrams and bigrams), for unigram there are 613822 features in total. 特征集基本上包括词袋特征（我尝试过unigram和bigrams），对于unigram，总共有613822个特征。

Besiedes, I added some lexicon-based features using SentiWordnet scores: calculating sum of positive and negative scores of nouns,adjs,ajvs,verbs and total words. Besiedes，我使用SentiWordnet分数添加了一些基于词典的功能：计算名词，adjs，ajvs，动词和总词的正负分数之和。 in a blog. 在博客中。 So for each blog, there will be 613822 + 5 features. 因此，对于每个博客，将有613822 + 5个功能。

I also applied some feature selection methods such as chi2 to reduce the feature number, but there isn't any apparent improvments. 我还应用了诸如chi2之类的一些特征选择方法来减少特征数量，但是没有任何明显的改进。

scikit-learn's CountVectorizer and DictVectorizer are used for vectorize the features, and Pipeline.FeatureUnion is used for concatenating them. scikit-learn的CountVectorizer和DictVectorizer用于向量化特征，而Pipeline.FeatureUnion用于对其进行串联。

I guess the poor result is due to the overly large bag of words feature set -- maybe there are too many misspelled words in the text? 我猜结果很差是由于过多的单词功能集-也许文本中拼写错误的单词太多了吗？ (I already removed the stops words and done some lemmatization jobs) I also think the lexicon-based features don't really work because the BOW feature is too large. （我已经删除了停用词并完成了一些词形化工作），我还认为基于词典的功能并不能真正起作用，因为BOW功能太大。

I'm hoping to find any obvious error in my approach or what I can do to improve the accuracy. 我希望在我的方法中发现任何明显的错误，或者可以采取哪些措施来提高准确性。

Thanks for any advice!! 感谢您的任何建议！

1 个解决方案

You are right, The problem is in this overly large number of features and you're over-fitting to it.. 没错，问题出在这过多的功能中，而您却过度使用了它。

Consider the following: 考虑以下：

1- Normalize the each blog, remove numbers, punctuation, links, html tags if any. 1-规范每个博客，删除数字，标点，链接，html标签（如果有）。

2- Consider Stemming instead of Lemmatization, Stemmers are much simpler, smaller and usually faster than lemmatizers, and for many applications their results are good enough. 2-考虑使用茎而不是去茎，茎比茎秆分离器更简单，更小并且通常更快，并且对于许多应用而言，它们的结果足够好。

http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 词干通常是指粗略的启发式过程，该过程会砍掉单词的结尾，以期在大多数时间正确实现此目标，并且通常包括删除派生词缀。 Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only 词法词化通常是指使用单词的词汇和词法分析来正确处理事物，通常仅旨在消除词尾变化

3- I worked in a similar problem before, What I did for features extraction, is for every sentiment of the 6 sentiments, I got the most frequent 500 words (for each class), then removed the common stems between their union. 3-我之前在一个类似的问题中工作过，我对特征提取所做的工作是针对6个情感中的每个情感，我得到了最频繁的500个单词（每个类别），然后移除了它们之间的共同词干。 The resulting list contained about 2000 words which then used as features list. 结果列表包含约2000个单词，然后用作功能列表。 Then I used Naive Bayes classifier. 然后，我使用朴素贝叶斯分类器。