Processing a corpus so big I'm getting runtime errors

Question

I am trying to process a big corpus of tweets (1,600,000, can be found here ) with the following code to train a Naive Bayes Classifier in order to play around with sentiment analysis.

My problem is I never coded anything that ever had to handle much memory or big variables.

At the moment the script runs for a while and then after a couple hours I get a runtime error (I'm on a Windows machine). I belive I'm not managing the list objects properly.

I am successfully running the program while limiting the for cycle but that means limiting the training set and quite likely getting worse sentiment analysis results.

How can I process the whole corpus? How can I better manage those lists? Are really those the ones causing the problem?

These are the imports

import pickle
import re
import os, errno
import csv
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier

Here I load the corpora and create the lists where I want to store the features I extract from the corpus

inpTweets = csv.reader(open('datasets/training.1600000.processed.noemoticon.csv', 'rb'), delimiter=',', quotechar='"')
tweets = []
featureList = []
n=0

This for cycle extracts the stuff from the corpora and thanks to processTweet(), a long algorithm, I extract the features from each row of the .CSV

for row in inpTweets:
    sentiment = row[0]
    status_text = row[5]
    featureVector = processTweet(status_text.decode('utf-8')) 
    #to know it's doing something
    n = n + 1
    print n
    #we'll need both the featurelist and the tweets variable, carrying tweets and sentiments

Here I extend/append the lists / the variables to the lists, we're still inside the for cycle.

    featureList.extend(featureVector)  
    tweets.append((featureVector, sentiment))

When the cycle ends I get rid of duplicates in the featureList and save it to a pickle.

featureList = list(set(featureList))
flist = open('fList.pickle', 'w')
pickle.dump(featureList, flist)
flist.close()

I get the features ready for the classifier.

training_set = nltk.classify.util.apply_features(extract_features, tweets)

Then I train the classifier and save it to a pickle.

# Train the Naive Bayes classifier
print "\nTraining the classifier.."
NBClassifier = nltk.NaiveBayesClassifier.train(training_set)
fnbc = open('nb_classifier.pickle', 'w')
pickle.dump(NBClassifier, fnbc)
fnbc.close()

edit: 19:45 gmt+1 - forgot to add n=0 in this post.

edit1: Due to lack of time and computing power limitations I choose to reduce the corpus like this -

.....
n=0
i=0
for row in inpTweets:
    i = i+1
    if (i==160):         #limiter
        i = 0
        sentiment = row[0]
        status_text = row[5]  
        n = n + 1
.....

As in the end the classifier was taking ages to train. About the runtime error please see the comments. Thanks everyone for the help.

Answer 1

You could use csv.field_size_limit(int)

For example:

f = open('datasets/training.1600000.processed.noemoticon.csv', 'rb')
csv.field_size_limit(100000)
inpTweets = csv.reader(f, delimiter=',', quotechar='"')

You can try changing the value 100,000 to something better maybe.

+1 on the comment about Pandas.

Also, you might want to check out cPickle here . (1000x faster)

Check out this question / answer too !

Another relevant blog post here .

Processing a corpus so big I'm getting runtime errors

Question

1 answers

solution1
0 ACCPTED 2015-02-02 22:44:15

Processing a corpus so big I'm getting runtime errors

Question

1 answers

solution1 0 ACCPTED 2015-02-02 22:44:15

solution1
0 ACCPTED 2015-02-02 22:44:15