简体   繁体   中英

Pipeline and GridSearchCV, and Multi-Class challenge for XGBoost and RandomForest

I am working on workflows using Pipeline and GridSearchCV.

MWE for RandomForest , as below,

#################################################################
# Libraries
#################################################################
import time
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

#################################################################
# Data loading and Symlinks
#################################################################
train = pd.read_csv("data_train.csv")
test = pd.read_csv("data_test.csv")

#################################################################
# Train Test Split
#################################################################
# Selected features - Training data
X = train.drop(columns='fault_severity')

# Training data
y = train.fault_severity

# Test data
x = test

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#################################################################
# Pipeline
#################################################################
pipe_rf = Pipeline([
    ('clf', RandomForestClassifier(random_state=0))
    ])

parameters_rf = {
        'clf__n_estimators':[30,40], 
        'clf__criterion':['entropy'], 
        'clf__min_samples_split':[15,20], 
        'clf__min_samples_leaf':[3,4]
    }

grid_rf = GridSearchCV(pipe_rf,
    param_grid=parameters_rf,
    scoring='neg_mean_absolute_error',
    cv=5,
    refit=True) 

#################################################################
# Modeling
#################################################################
start_time = time.time()

grid_rf.fit(X_train, y_train)

#Calculate the score once and use when needed
mae = grid_rf.score(X_valid,y_valid)

print("Best params                        : %s" % grid_rf.best_params_)
print("Best training data MAE score       : %s" % grid_rf.best_score_)    
print("Best validation data MAE score (*) : %s" % mae)
print("Modeling time                      : %s" % time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))

#################################################################
# Prediction
#################################################################
#Predict using the test data with selected features
y_pred = grid_rf.predict(x)

# Transform numpy array to dataframe
y_pred = pd.DataFrame(y_pred)

# Rearrange dataframe
y_pred.columns = ['prediction']
y_pred.insert(0, 'id', x['id'])

# Save to CSV
y_pred.to_csv("data_predict.csv", index = False, header=True)
#Output
# id,prediction
# 11066,0
# 18000,2
# 16964,0
# ...., ....

Have a MWE for XGBoost as below,

#################################################################
# Libraries
#################################################################
import time
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#################################################################
# Data loading and Symlinks
#################################################################
train = pd.read_csv("data_train.csv")
test = pd.read_csv("data_test.csv")

#################################################################
# Train Test Split
#################################################################

# Selected features - Training data
X = train.drop(columns='fault_severity')

# Training data
y = train.fault_severity

# Test data
x = test

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#################################################################
# DMatrix
#################################################################
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=test)

params = {
    'max_depth': 6,
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3,
    'n_gpus': 0
}

#################################################################
# Modeling
#################################################################
start_time = time.time()
bst = xgb.train(params, dtrain)

#################################################################
# Prediction
#################################################################
#Predict using the test data with selected features
y_pred = bst.predict(dtest)

# Transform numpy array to dataframe
y_pred = pd.DataFrame(y_pred)

# Rearrange dataframe
y_pred.columns = ['prediction_0', 'prediction_1', 'prediction_2']
y_pred.insert(0, 'id', x['id'])

# Save to CSV
y_pred.to_csv("data_predict_xgb.csv", index = False, header=True)
# Expected Output:
# id,prediction_0,prediction_1,prediction_2
# 11066,0.4674369,0.46609518,0.06646795
# 18000,0.7578633,0.19379888,0.048337903
# 16964,0.9296321,0.04505246,0.025315404
# ...., ...., ...., ....

Questions:

  1. How does one convert the MWE for XGBoost using the Pipeline and GridSearchCV technique in MWE for RandomForest? Have to use 'num_class' where XGBRegressor() does not support.

  2. How to have a multi-class prediction output for RandomForrest as XGBoost (ie predict_0, predict_1, predict_2)? The sample output are given in the MWEs above. I found num_class is is not supported by RandomForest Classifier.

I have spent several days working on this and still been blocked. Appreciate some pointers to move forward.

Data:

  1. data_train: https://www.dropbox.com/s/bnomyoidkcgyb2y/data_train.csv
  2. data_test: https://www.dropbox.com/s/kn1bgde3hsf6ngy/data_test.csv

I presume in your first question, you did not mean to refer to XGBRegressor .

In order to allow an XGBClassifier to run in the pipeline, you simply need to change the initial definition of the pipeline:

params = {
    'max_depth': 6,
    'objective': 'multi:softprob',
    'num_class': 3,
    'n_gpus': 0
}
pipe_xgb = Pipeline([
    ('clf', xgb.XGBClassifier(**params))
])

(Note: I've changed the pipeline name to pipe_xgb , so you would need to change this in the rest of your code.)

As you can see from the answer to this question , XGBoost automatically switches to multiclass classification if there are more than two classes in the target variable. So you neither can, nor need to, specify num_class .

You should also change the metric to one for classification, as in each of your examples you use MAE which is a regression metric.

Here's a complete example of your code, using XGBClassifier with accuracy as the metric:

#################################################################
# Libraries
#################################################################
import time
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import xgboost as xgb

#################################################################
# Data loading and Symlinks
#################################################################
train = pd.read_csv("https://dl.dropbox.com/s/bnomyoidkcgyb2y/data_train.csv?dl=0")
test = pd.read_csv("https://dl.dropbox.com/s/kn1bgde3hsf6ngy/data_test.csv?dl=0")

#################################################################
# Train Test Split
#################################################################
# Selected features - Training data
X = train.drop(columns='fault_severity')

# Training data
y = train.fault_severity

# Test data
x = test

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)


#################################################################
# Pipeline
#################################################################
params = {
    'max_depth': 6,
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3,
    'n_gpus': 0
}
pipe_xgb = Pipeline([
    ('clf', xgb.XGBClassifier(**params))
    ])

parameters_xgb = {
        'clf__n_estimators':[30,40], 
        'clf__criterion':['entropy'], 
        'clf__min_samples_split':[15,20], 
        'clf__min_samples_leaf':[3,4]
    }

grid_xgb = GridSearchCV(pipe_xgb,
    param_grid=parameters_xgb,
    scoring='accuracy',
    cv=5,
    refit=True)

#################################################################
# Modeling
#################################################################
start_time = time.time()

grid_xgb.fit(X_train, y_train)

#Calculate the score once and use when needed
acc = grid_xgb.score(X_valid,y_valid)

print("Best params                        : %s" % grid_xgb.best_params_)
print("Best training data accuracy        : %s" % grid_xgb.best_score_)    
print("Best validation data accuracy (*)  : %s" % acc)
print("Modeling time                      : %s" % time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))

#################################################################
# Prediction
#################################################################
#Predict using the test data with selected features
y_pred = grid_xgb.predict(X_valid)

# Transform numpy array to dataframe
y_pred = pd.DataFrame(y_pred)

# Rearrange dataframe
y_pred.columns = ['prediction']
y_pred.insert(0, 'id', x['id'])
accuracy_score(y_valid, y_pred.prediction)

Edit to address additional question in a comment.

You can use the predict_proba method of xgb 's sklearn API to get probabilities for each class:

y_pred = pd.DataFrame(grid_xgb.predict_proba(X_valid),
                      columns=['prediction_0', 'prediction_1', 'prediction_2'])
y_pred.insert(0, 'id', x['id'])

With the above code, y_pred has the following format:

      id  prediction_0  prediction_1  prediction_2
0  11066      0.490955      0.436085      0.072961
1  18000      0.718351      0.236274      0.045375
2  16964      0.920252      0.052558      0.027190
3   4795      0.958216      0.021558      0.020226
4   3392      0.306204      0.155550      0.538246

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM