简体   繁体   中英

Python, sklearn: Order of Pipeline operation with MinMaxScaler and SVC

I have a dataset that I want to run the sklearn SVM's SVC model on. The magnitudes of some values of the features are in the range of [0, 1e+7]. I have tried to use SVC w/o preprocessing and I either get unacceptably long compute times, or 0 true positive predictions. Thusly, I am attempting to implement a preprocessing step, particularly the MinMaxScaler .

My code so far:

selection_KBest = SelectKBest()
selection_PCA = PCA()
combined_features = FeatureUnion([("pca", selection_PCA), 
                                  ("univ_select", selection_KBest)])
param_grid = dict(features__pca__n_components = range(feature_min,feature_max),
                  features__univ_select__k = range(feature_min,feature_max))
svm = SVC()            
pipeline = Pipeline([("features", combined_features), 
                     ("scale", MinMaxScaler(feature_range=(0, 1))),
                     ("svm", svm)])
param_grid["svm__C"] = [0.1, 1, 10]
cv = StratifiedShuffleSplit(y = labels_train, 
                            n_iter = 10, 
                            test_size = 0.1, 
                            random_state = 42)
grid_search = GridSearchCV(pipeline,
                           param_grid = param_grid, 
                           verbose = 1,
                           cv = cv)
grid_search.fit(features_train, labels_train)
"(grid_search.best_estimator_): ", (grid_search.best_estimator_)

My question is specific to line:

pipeline = Pipeline([("features", combined_features), 
                     ("scale", MinMaxScaler(feature_range=(0, 1))),
                     ("svm", svm)])

I would like to know what the best logic is for my program, and thus the order of features , scale , svm in pipeline . Specifically, I cannot decide if features and scale should be switched from what it is now.

Note 1: I would like to use grid_search.best_estimator_ as my Classifier model going forward for predictions.

Note 2: My concern is the correct way to formulate pipeline so that upon prediction step, the features are selected from the way it was done in the training step AND scaled.

Note 3: I notice that svm doesn't appear in my grid_search.best_estimator_ result. Does this mean it is not being invoked correctly?

Below are some results that indicate that order may matter:

pipeline = Pipeline([("scale", MinMaxScaler(feature_range=(0, 1))),
                     ("features", combined_features), 
                     ("svm", svm)]):

Pipeline(steps=[('scale', MinMaxScaler(copy=True, feature_range=(0, 1)))
('features', FeatureUnion(n_jobs=1, transformer_list=[('pca', PCA(copy=True, 
n_components=11, whiten=False)), ('univ_select', SelectKBest(k=2, 
score_func=<function f_classif at 0x000000001ED61208>))], 
transformer_weights=...f', max_iter=-1, probability=False, 
random_state=None, shrinking=True, tol=0.001, verbose=False))])

Accuracy: 0.86247   Precision: 0.38947  Recall: 0.05550 
F1: 0.09716 F2: 0.06699 Total predictions: 15000    
True positives:  111    False positives:  174   
False negatives: 1889   True negatives: 12826


pipeline = Pipeline([("features", combined_features),
                     ("scale", MinMaxScaler(feature_range=(0, 1))), 
                     ("svm", svm)]):

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, n_components=1, whiten=False)), 
('univ_select', SelectKBest(k=1, score_func=<function f_classif at   
0x000000001ED61208>))],
transformer_weights=None)), ('scale', MinMaxScaler(copy=True, feature_range=
(0,...f', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])

Accuracy: 0.86680   Precision: 0.50463  Recall: 0.05450 
F1: 0.09838 F2: 0.06633 Total predictions: 15000    
True positives:  109    False positives:  107   
False negatives: 1891   True negatives: 12893

EDIT 1 16041310: Note 3 resolved. Use grid_search.best_estimator_.steps to get full steps.


There is a parameter refit in GridsearchCV (which defaults to True ) which means that the best estimator will be refit against the full dataset; you will then be access this estimator with best_estimator_ , or just with the fit method on your GridsearchCV object.

The best_estimator_ will be the full pipeline, if you call predict on it, you'll get the same preprocessing steps as in your training stage.

If you want to print out all the steps, you could do

print(grid_search.best_estimator_.steps)

or

for step in grid_search.best_estimator_.steps:
    print(type(step))
    print(step.get_params())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM