scikit-learn: Classification timing correct?

Question

Hi I'm classifying tweets into 7 classes. I have about 250.000 training tweets and another different 250.000 testing tweets. My code can be found bellow. training.pkl are the training tweets, testing.pkl the testing tweets. I also have the corresponding labels as you can see.

When I execute my code I see that it takes 14.9649999142 seconds to covert the testing set (raw) to a feature space. And I also measure how long it takes to classify all the tweets in the testing set, which is 0.131999969482 seconds.

Though this seems very unlikely to me that this framework is able to classify about 250.000 tweets in 0.131999969482 seconds. My question is now, is this correct ?

file = open("training.pkl", 'rb')
training = cPickle.load(file)
file.close()


file = open("testing.pkl", 'rb')
testing = cPickle.load(file)
file.close()

file = open("ground_truth_testing.pkl", 'rb')
ground_truth_testing = cPickle.load(file)
file.close()

file = open("ground_truth_training.pkl", 'rb')
ground_truth_training = cPickle.load(file)
file.close()


print 'data loaded'
tweetsTestArray = np.array(testing)
tweetsTrainingArray = np.array(training)
y_train = np.array(ground_truth_training)


# Transform dataset to a design matrix with TFIDF and 1,2 gram
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,  ngram_range=(1, 2))

X_train = vectorizer.fit_transform(tweetsTrainingArray)
print "n_samples: %d, n_features: %d" % X_train.shape


print 'COUNT'
_t0 = time.time()
X_test = vectorizer.transform(tweetsTestArray)
print "n_samples: %d, n_features: %d" % X_test.shape
_t1 =  time.time()

print  _t1 - _t0
print 'STOP'

# TRAINING & TESTING

print 'SUPERVISED'
print '----------------------------------------------------------'
print 

print 'SGD'

#Initialize Stochastic Gradient Decent
sgd = linear_model.SGDClassifier(loss='modified_huber',alpha = 0.00003, n_iter = 25)

#Train
sgd.fit(X_train, ground_truth_training)

#Predict

print "START COUNT"
_t2 = time.time()
target_sgd = sgd.predict(X_test)
_t3 = time.time()

print _t3 -_t2
print "END COUNT"

# Print report
report_sgd = classification_report(ground_truth_testing, target_sgd)
print report_sgd
print

X_train printed

 <248892x213162 sparse matrix of type '<type 'numpy.float64'>'
    with 4346880 stored elements in Compressed Sparse Row format>

X_train printen

 <249993x213162 sparse matrix of type '<type 'numpy.float64'>'
    with 4205309 stored elements in Compressed Sparse Row format>

Answer 1

What is the shape and the number of non-zero features in the extracted X_train and X_test sparse matrices? Do they approximatively relate to the number of words in your corpus?

Classification is expected to be much faster than feature extraction for linear models. It's just computing a dot product hence directly linear with the number of non-zeros (ie approximatively the number of words in your test set).

Edit : to get stats on the content of a sparse matrix X_train and X_test just do:

>>> print repr(X_train)
>>> print repr(X_test)

Edit 2 : Your numbers looks good. Linear model prediction on the numerical features is indeed much faster than feature extraction:

>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> twenty = fetch_20newsgroups()
>>> %time X = TfidfVectorizer().fit_transform(twenty.data)
CPU times: user 10.74 s, sys: 0.32 s, total: 11.06 s
Wall time: 11.04 s

>>> X
<11314x56436 sparse matrix of type '<type 'numpy.float64'>'
    with 1713894 stored elements in Compressed Sparse Row format>
>>> from sklearn.linear_model import SGDClassifier

>>> %time clf = SGDClassifier().fit(X, twenty.target)
CPU times: user 0.50 s, sys: 0.01 s, total: 0.51 s
Wall time: 0.51 s

>>> %time clf.predict(X)
CPU times: user 0.10 s, sys: 0.00 s, total: 0.11 s
Wall time: 0.11 s
array([7, 4, 4, ..., 3, 1, 8])

scikit-learn: Classification timing correct?

Question

1 answers

solution1
2 ACCPTED 2013-01-09 11:51:53

scikit-learn: Classification timing correct?

Question

1 answers

solution1 2 ACCPTED 2013-01-09 11:51:53

solution1
2 ACCPTED 2013-01-09 11:51:53