NLTK sentiment analysis is only returning one value

Question

I seriously hate to post a question about an entire chunk of code, but I've been working on this for the past 3 hours and I can't wrap my head around what is happening. I have approximately 600 tweets I am retrieving from a CSV file with varying score values (between -2 to 2) reflecting the sentiment towards a presidential candidate.

However, when I run this training sample on any other data, only one value is returned (positive). I have checked to see if the scores were being added correctly and they are. It just doesn't make sense to me that 85,000 tweets would all be rated "positive" from a diverse training set of 600. Does anyone know what is happening here? Thanks!

import nltk
import csv

tweets = []
import ast
with open('romney.csv', 'rb') as csvfile:
    mycsv = csv.reader(csvfile)
    for row in mycsv:
        tweet = row[1]
        try:
            score = ast.literal_eval(row[12])
            if score > 0:
                print score
                print tweet
                tweets.append((tweet,"positive"))

        elif score < 0:
            print score
            print tweet
            tweets.append((tweet,"negative"))
    except ValueError:
        tweet = ""

def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
      all_words.extend(words)
    return all_words

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
    features['contains(%s)' % word] = (word in document_words)
    return features

word_features = get_word_features(get_words_in_tweets(tweets))
training_set = nltk.classify.apply_features(extract_features, tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)
c = 0
with open('usa.csv', "rU") as csvfile:
    mycsv = csv.reader(csvfile)
    for row in mycsv:
        try:
            tweet = row[0]
            c = c + 1
                    print classifier.classify(extract_features(tweet.split()))                                                                                                                                                                                     
        except IndexError:
            tweet = ""

Answer 1

Naive Bayes Classifier usually works best when evaluating words that appear in the document, ignoring absence of words. Since you use

features['contains(%s)' % word] = (word in document_words)

each document is mostly represented by features with a value = False.

Try instead something like:

if word in document_words:
   features['contains(%s)' % word] = True

(you should probably also change the for loop for something more efficient than looping over all words in the lexicon, looping instead on words occurring in the document).

NLTK sentiment analysis is only returning one value

Question

1 answers

solution1
2 2013-02-27 20:49:56

NLTK sentiment analysis is only returning one value

Question

1 answers

solution1 2 2013-02-27 20:49:56

solution1
2 2013-02-27 20:49:56