简体繁体中英

Bias towards negative sentiments from Stanford CoreNLP

原文 2014-09-08 16:49:41 0 2 java/ twitter/ nlp/ stanford-nlp/ sentiment-analysis

I'm experimenting with deriving sentiment from Twitter using Stanford's CoreNLP library, a la https://www.openshift.com/blogs/day-20-stanford-corenlp-performing-sentiment-analysis-of-twitter-using-java - so see here for the code that I'm implementing.

I am getting results, but I've noticed that there appears to be a bias towards 'negative' results, both in my target dataset and another dataset I use with ground truth - the Sanders Analytics Twitter Sentiment Corpus http://www.sananalytics.com/lab/twitter-sentiment/ - even though the ground truth data do not have this bias.

I'm posting this question on the off chance that someone else has experienced this and/or may know if this is the result of something I've done or some bug in the CoreNLP code.

(edit - sorry it took me so long to respond) I am posting links to plots showing what I mean. I don't have enough reputation to post the images, and can only include two links in this post, so I'll add the links in the comments.

2 answers

I'd like to suggest this is simply a domain mismatch. The Stanford RNTN is trained on movie review snippets and you are testing on twitter data. Other than the topics mismatch, tweets also tend to be ungrammatical and use abbreviated ("creative") language. If I had to suggest a more concrete reason, I would start with a lexical mismatch. Perhaps negative emotions are expressed in a domain-independent way, eg with common adjectives, and positive emotions are more domain-dependent or more subtle.

It's still interesting that you're getting a negative bias. The Polyanna hypothesis suggests a positive bias, IMHO.

Going beyond your original question, there are several approaches to do sentiment analysis specifically on microblogging data. See eg "The Good, The Bad and the OMG!" by Kouloumpis et al.

Michael Haas points out correctly that there is a domain mismatch, which is also specified by Richard Socher in the comments section.

Sentences with a lot of unknown words and imperfect punctuation get flagged as negative.

If you are using Python, VADER is a great tool for twitter sentiment analysis. It is a rule based tool with only ~300 lines of code and a custom made lexicon for twitter, which has ~8000 words including slangs and emoticons.

It is easy to modify the rules as well as the lexicon, without any need for re-training. It is fully free and open source.

How to train Stanford CoreNLP for other language sentiments?

Stanford Sentiment Analysis is biased towards negative?

Sentiments Scores Stanford Core NLP

how to use openie from stanford-corenlp without using lemma from stanford-corenlp?

Stanford CoreNLP lemmatization

Stanford CoreNLP gives NullPointerException

Mute Stanford coreNLP logging

OpenNLP vs Stanford CoreNLP

stanford corenlp serialization exception

Using Stanford CoreNLP

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to train Stanford CoreNLP for other language sentiments? Stanford Sentiment Analysis is biased towards negative? Sentiments Scores Stanford Core NLP how to use openie from stanford-corenlp without using lemma from stanford-corenlp? Stanford CoreNLP lemmatization Stanford CoreNLP gives NullPointerException Mute Stanford coreNLP logging OpenNLP vs Stanford CoreNLP stanford corenlp serialization exception Using Stanford CoreNLP

Related Tags

Bias towards negative sentiments from Stanford CoreNLP

Question

2 answers

solution1
5 ACCPTED 2014-12-01 07:44:58

solution2
4 2015-09-09 09:37:20

Bias towards negative sentiments from Stanford CoreNLP

Question

2 answers

solution1 5 ACCPTED 2014-12-01 07:44:58

solution2 4 2015-09-09 09:37:20

solution1
5 ACCPTED 2014-12-01 07:44:58

solution2
4 2015-09-09 09:37:20