簡體   English   中英

需要更好地了解Python scikit-learn fit預測循環與線性結果

[英]Need to better understand Python scikit-learn fit predict looping vs linear result

這是我的Python塊(2.7,[我學習了Python 3,所以使用將來的print_function來獲得我習慣使用的打印格式]),使用了scikit-learn的一些修訂版本中的學習代碼,由於公司的IT政策而被鎖定。 它使用SVC引擎。 我不明白的是,在第一種情況(使用simple_clf)和第二種情況下,我在+/- 1情況下得到的結果是不同的。 但是從結構上講,我認為它們與一次處理和一次完整的數據數組完全相同,而第二次只是一次使用一個數組中的數據。 然而結果並不相同。 為平均(平均)分數生成的值應為十進制百分比(0.0到1.0)。 在某些情況下,差異很小,但在其他方面卻足夠大,足以讓我問我的問題。

from __future__ import print_function
import os
import numpy as np
from numpy import array, loadtxt
from sklearn import cross_validation, datasets, svm, preprocessing, grid_search
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score

GRADES = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'M']

# Initial processing
featurevecs = loadtxt( FEATUREVECFILE )
f = open( SCORESFILE )
scorelines = f.readlines()[ 1: ] # Skip header line
f.close()
scorenums = [ GRADES.index( l.split( '\t' )[ 1 ] ) for l in scorelines ]
scorenums = array( scorenums )

# Need this step to normalize the feature vectors
scaler = preprocessing.Scaler()
scaler.fit( featurevecs )
featurevecs = scaler.transform( featurevecs )

# Break up the vector into a training and testing vector
# Need to keep the training set somewhat large to get enough of the
# scarce results in the training set or the learning fails
X_train, X_test, y_train, y_test = train_test_split(
    featurevecs, scorenums, test_size = 0.333, random_state = 0 )

# Define a range of parameters we can use to do a grid search
# for the 'best' ones.
CLFPARAMS = {'gamma':[.0025, .005, 0.09, .01, 0.011, .02, .04],
             'C':[200, 300, 400, 500, 600]}

# do a simple cross validation
simple_clf = svm.SVC()
simple_clf = grid_search.GridSearchCV( simple_clf, CLFPARAMS, cv = 3 )
simple_clf.fit( X_train, y_train )
y_true, y_pred = y_test, simple_clf.predict( X_test )
match = 0
close = 0
count = 0
deviation = []
for i in range( len( y_true ) ):
    count += 1
    delta = np.abs( y_true[ i ] - y_pred[ i ] )
    if( delta == 0 ):
        match += 1
    elif( delta == 1 ):
        close += 1
    deviation = np.append( deviation, 
                           float( np.sum( np.abs( delta ) <= 1 ) ) )
avg = float( match ) / float( count )
close_avg = float( close ) / float( count )
#deviation.mean() = avg + close_avg
print( '{0} Accuracy (+/- 0) {1:0.4f} Accuracy (+/- 1) {2:0.4f} (+/- {3:0.4f}) '.format( test_type, avg, deviation.mean(), deviation.std() / 2.0, ), end = "" )

# "Original" code
# do LeaveOneOut item by item
clf = svm.SVC()
clf = grid_search.GridSearchCV( clf, CLFPARAMS, cv = 3 )
toleratePara = 1;
thecurrentScoreGraded = []
loo = cross_validation.LeaveOneOut( n = len( featurevecs ) )
for train, test in loo:
    try:
        clf.fit( featurevecs[ train ], scorenums[ train ] )
        rawPredictionResult = clf.predict( featurevecs[ test ] )

        errorVec = scorenums[ test ] - rawPredictionResult;
        print( len( errorVec ), errorVec )
        thecurrentScoreGraded = np.append( thecurrentScoreGraded, float( np.sum( np.abs( errorVec ) <= toleratePara ) ) / len( errorVec ) )
    except ValueError:
        pass
print( '{0} Accuracy (+/- {1:d}) {2:0.4f} (+/- {3:0.4f})'.format( test_type, toleratePara, thecurrentScoreGraded.mean(), thecurrentScoreGraded.std() / 2 ) )

這是我的結果,您可以看到它們不匹配。 我實際的工作任務是查看究竟更改收集的哪種數據以供學習引擎使用,是否會有助於提高准確性,或者將數據合並為更大的教學向量是否有幫助,所以您會發現我正在研究一堆組合。 每對線用於一種學習數據。 第一行是我的結果,第二行是基於“原始”代碼的結果。

original Accuracy (+/- 0) 0.2771 Accuracy (+/- 1) 0.6024 (+/- 0.2447) 
                        original Accuracy (+/- 1) 0.6185 (+/- 0.2429)
upostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384) 
                        upostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398)
npostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384) 
                        npostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398)
tancurv Accuracy (+/- 0) 0.2330 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
                        tancurv Accuracy (+/- 1) 0.5831 (+/- 0.2465)
npostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199) 
                        npostan Accuracy (+/- 1) 0.7003 (+/- 0.2291)
nposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
                        nposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453)
upostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199) 
                        upostan Accuracy (+/- 1) 0.7003 (+/- 0.2291)
uposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
                        uposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453)
upos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293) 
                        upos Accuracy (+/- 1) 0.6450 (+/- 0.2393)
npos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293) 
                        npos Accuracy (+/- 1) 0.6450 (+/- 0.2393)
curv Accuracy (+/- 0) 0.1553 Accuracy (+/- 1) 0.4854 (+/- 0.2499) 
                        curv Accuracy (+/- 1) 0.5570 (+/- 0.2484)
tan Accuracy (+/- 0) 0.3107 Accuracy (+/- 1) 0.7184 (+/- 0.2249) 
                        tan Accuracy (+/- 1) 0.7231 (+/- 0.2237)

“在結構上它們是相同的”是什么意思? 您使用不同的子集進行訓練和測試,並且它們具有不同的大小。 如果您使用的訓練數據不完全相同,那么我看不出您為什么期望結果相同。

順便說一句,還請參閱文檔中關於LOO的注釋 LOO可能有很大的差異。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM