简体   繁体   中英

Scikitlearn - order of fit and predict inputs, does it matter?

Just getting started with this library... having some issues (i've read the docs but didn't get clarity) with RandomForestClassifiers

My question is pretty simple, say i have a train data set like

ABC

1 2 3

Where A is the independent variable (y) and BC are the dependent variables (x). Let's say the test set looks the same, however the order is

BAC

1 2 3

When I call forest.fit(train_data[0:,1:],train_data[0:,0]) do I then need to reorder the test set to match this order before running? (Ignoring the fact that I need to remove the already predicted y value (a), so lets just say B and C are out of order... )

Yes, you need to reorder them. Imagine a simpler case, Linear Regression. The algorithm will calculate the weights for each of the features, so for example if feature 1 is unimportant, it will get assigned a close to 0 weight.

If at prediction time the order is different, an important feature will be multiplied by this almost null weight, and the prediction will be totally off.

elyase is correct. scikit-learn will simply take the data in whatever order you give it. Hence, you'll have to ensure that the data is in the same order during training and prediction time.

Here's a simple illustrating example:

Training time:

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
x = pd.DataFrame({
    'feature_1': [0, 0, 1, 1],
    'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y) 
# we now have a model that 
# (i)  predicts 0 when x = [0, 0] or [0, 1], and 
# (ii) predicts 1 when x = [1, 0] or [1, 1]

Prediction time:

# positive example
http_request_payload = {
    'feature_1': 0,
    'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected


# negative example
http_request_payload = {
    'feature_2': 1,    # notice that the order is jumbled up
    'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0. 
# scikit-learn doesn't care about the key-value mapping of the features. 
# it simply vectorizes the dataframe in whatever order it comes in.

This is how I cache the column order during training so that I can use it during prediction time.

# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model

# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib  

joblib.dump(model, 'my_model.joblib') 
joblib.dump(column_order, 'column_order.txt') 

# load the artifacts from disk
model = joblib.load('linear_model.joblib') 
column_order = joblib.load('column_order.txt') 

# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }

# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like

model.predict(input_features.values.tolist())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM