[英]scikit-learn: Classification timing correct?
嗨,我正在將推文分為7個班級。 我有大約250,000條培訓推文和另外250.000條不同的測試推文。 我的代碼可以在下面找到。 training.pkl是培訓推文,testing.pkl測試推文。 你可以看到我也有相應的標簽。
當我執行我的代碼時,我發現將測試集(原始)轉換為特征空間需要14.9649999142秒。 我還測量了在測試集中對所有推文進行分類所需的時間,即0.131999969482秒。
雖然這對我來說似乎不太可能,這個框架能夠在0.131999969482秒內對250.000條推文進行分類。 我現在的問題是,這是正確的嗎?
file = open("training.pkl", 'rb')
training = cPickle.load(file)
file.close()
file = open("testing.pkl", 'rb')
testing = cPickle.load(file)
file.close()
file = open("ground_truth_testing.pkl", 'rb')
ground_truth_testing = cPickle.load(file)
file.close()
file = open("ground_truth_training.pkl", 'rb')
ground_truth_training = cPickle.load(file)
file.close()
print 'data loaded'
tweetsTestArray = np.array(testing)
tweetsTrainingArray = np.array(training)
y_train = np.array(ground_truth_training)
# Transform dataset to a design matrix with TFIDF and 1,2 gram
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(tweetsTrainingArray)
print "n_samples: %d, n_features: %d" % X_train.shape
print 'COUNT'
_t0 = time.time()
X_test = vectorizer.transform(tweetsTestArray)
print "n_samples: %d, n_features: %d" % X_test.shape
_t1 = time.time()
print _t1 - _t0
print 'STOP'
# TRAINING & TESTING
print 'SUPERVISED'
print '----------------------------------------------------------'
print
print 'SGD'
#Initialize Stochastic Gradient Decent
sgd = linear_model.SGDClassifier(loss='modified_huber',alpha = 0.00003, n_iter = 25)
#Train
sgd.fit(X_train, ground_truth_training)
#Predict
print "START COUNT"
_t2 = time.time()
target_sgd = sgd.predict(X_test)
_t3 = time.time()
print _t3 -_t2
print "END COUNT"
# Print report
report_sgd = classification_report(ground_truth_testing, target_sgd)
print report_sgd
print
X_train打印
<248892x213162 sparse matrix of type '<type 'numpy.float64'>'
with 4346880 stored elements in Compressed Sparse Row format>
X_train printen
<249993x213162 sparse matrix of type '<type 'numpy.float64'>'
with 4205309 stored elements in Compressed Sparse Row format>
提取的X_train
和X_test
稀疏矩陣的非零特征的形狀和數量是多少? 它們是否與您語料庫中的單詞數量近似相關?
預計分類比線性模型的特征提取快得多。 它只是計算一個點積,因此與非零的數量直接成線性關系(即近似於測試集中的單詞數)。
編輯 :獲取稀疏矩陣X_train
和X_test
內容的統計信息:
>>> print repr(X_train)
>>> print repr(X_test)
編輯2 :你的數字看起來不錯。 數值特征的線性模型預測確實比特征提取快得多:
>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> twenty = fetch_20newsgroups()
>>> %time X = TfidfVectorizer().fit_transform(twenty.data)
CPU times: user 10.74 s, sys: 0.32 s, total: 11.06 s
Wall time: 11.04 s
>>> X
<11314x56436 sparse matrix of type '<type 'numpy.float64'>'
with 1713894 stored elements in Compressed Sparse Row format>
>>> from sklearn.linear_model import SGDClassifier
>>> %time clf = SGDClassifier().fit(X, twenty.target)
CPU times: user 0.50 s, sys: 0.01 s, total: 0.51 s
Wall time: 0.51 s
>>> %time clf.predict(X)
CPU times: user 0.10 s, sys: 0.00 s, total: 0.11 s
Wall time: 0.11 s
array([7, 4, 4, ..., 3, 1, 8])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.