scikit-learn - 預測新輸入的訓練模型

Question

我有一個如下所示的數據集：

| "Consignor Code" | "Consignee Code" | "Origin" | "Destination" | "Carrier Code" | 
|------------------|------------------|----------|---------------|----------------| 
| "6402106844"     | "66903717"       | "DKCPH"  | "CNPVG"       | "6402746387"   | 
| "6402106844"     | "66903717"       | "DKCPH"  | "CNPVG"       | "6402746387"   | 
| "6402106844"     | "6404814143"     | "DKCPH"  | "CNPVG"       | "6402746387"   | 
| "6402107662"     | "66974631"       | "DKCPH"  | "VNSGN"       | "6402746393"   | 
| "6402107662"     | "6404518090"     | "DKCPH"  | "THBKK"       | "6402746393"   | 
| "6402107662"     | "6404518090"     | "DKBLL"  | "THBKK"       | "6402746393"   | 
| "6408507648"     | "6403601344"     | "DKCPH"  | "USTPA"       | "66565231"     |

我正在嘗試在其上構建我的第一個 ML 模型。 為此，我正在使用 scikit-learn。 這是我的代碼：

#Import the dependencies
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.externals import joblib
from sklearn import preprocessing
import pandas as pd

#Import the dataset (A CSV file)
dataset = pd.read_csv('shipments.csv', header=0, skip_blank_lines=True)
#Drop any rows containing NaN values
dataset.dropna(subset=['Consignor Code', 'Consignee Code',
                       'Origin', 'Destination', 'Carrier Code'], inplace=True)

#Convert the numeric only cells to strings
dataset['Consignor Code'] = dataset['Consignor Code'].astype('int64')
dataset['Consignee Code'] = dataset['Consignee Code'].astype('int64')
dataset['Carrier Code'] = dataset['Carrier Code'].astype('int64')

#Define our target (What we want to be able to predict)
target = dataset.pop('Destination')

#Convert all our data to numeric values, so we can use the .fit function.
#For that, we use LabelEncoder
le = preprocessing.LabelEncoder()
target = le.fit_transform(list(target))
dataset['Origin'] = le.fit_transform(list(dataset['Origin']))
dataset['Consignor Code'] = le.fit_transform(list(dataset['Consignor Code']))
dataset['Consignee Code'] = le.fit_transform(list(dataset['Consignee Code']))
dataset['Carrier Code'] = le.fit_transform(list(dataset['Carrier Code']))

#Prepare the dataset.
X_train, X_test, y_train, y_test = train_test_split(
    dataset, target, test_size=0.3, random_state=0)


#Prepare the model and .fit it.
model = RandomForestClassifier()
model.fit(X_train, y_train)

#Make a prediction on the test set.
predictions = model.predict(X_test)

#Print the accuracy score.
print("Accuracy score: {}".format(accuracy_score(y_test, predictions)))

現在上面的代碼返回：

Accuracy score: 0.7172413793103448

現在我的問題可能很愚蠢 - 但是我如何使用我的model來實際向我展示它對新數據的預測？

考慮下面的新輸入，我希望它預測Destination ：

"6408507648","6403601344","DKCPH","","66565231"

如何使用這些數據查詢我的模型並獲得預測的Destination ？

Answer 1

在這里，您有一個完整的工作示例，其中包含預測。 最重要的部分是為每個特征定義不同的標簽編碼器，這樣你就可以用相同的編碼擬合新數據，否則你會遇到錯誤（現在可能會顯示，但你會在計算准確度時注意到）：

dataset = pd.DataFrame({'Consignor Code':["6402106844","6402106844","6402106844","6402107662","6402107662","6402107662","6408507648"],
                   'Consignee Code': ["66903717","66903717","6404814143","66974631","6404518090","6404518090","6403601344"],
                   'Origin':["DKCPH","DKCPH","DKCPH","DKCPH","DKCPH","DKBLL","DKCPH"],
                   'Destination':["CNPVG","CNPVG","CNPVG","VNSGN","THBKK","THBKK","USTPA"],
                   'Carrier Code':["6402746387","6402746387","6402746387","6402746393","6402746393","6402746393","66565231"]})

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.externals import joblib
from sklearn import preprocessing
import pandas as pd

#Import the dataset (A CSV file)
#Drop any rows containing NaN values
dataset.dropna(subset=['Consignor Code', 'Consignee Code',
                       'Origin', 'Destination', 'Carrier Code'], inplace=True)


#Define our target (What we want to be able to predict)
target = dataset.pop('Destination')

#Convert all our data to numeric values, so we can use the .fit function.
#For that, we use LabelEncoder
le_origin = preprocessing.LabelEncoder()
le_consignor = preprocessing.LabelEncoder()
le_consignee = preprocessing.LabelEncoder()
le_carrier = preprocessing.LabelEncoder()
le_target = preprocessing.LabelEncoder()
target = le_target.fit_transform(list(target))
dataset['Origin'] = le_origin.fit_transform(list(dataset['Origin']))
dataset['Consignor Code'] = le_consignor.fit_transform(list(dataset['Consignor Code']))
dataset['Consignee Code'] = le_consignee.fit_transform(list(dataset['Consignee Code']))
dataset['Carrier Code'] = le_carrier.fit_transform(list(dataset['Carrier Code']))

#Prepare the dataset.
X_train, X_test, y_train, y_test = train_test_split(
    dataset, target, test_size=0.3, random_state=42)


#Prepare the model and .fit it.
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

#Make a prediction on the test set.
predictions = model.predict(X_test)

#Print the accuracy score.
print("Accuracy score: {}".format(accuracy_score(y_test, predictions)))

new_input = ["6408507648","6403601344","DKCPH","66565231"]
fitted_new_input = np.array([le_consignor.transform([new_input[0]])[0],
                                le_consignee.transform([new_input[1]])[0],
                                le_origin.transform([new_input[2]])[0],
                                le_carrier.transform([new_input[3]])[0]])
new_predictions = model.predict(fitted_new_input.reshape(1,-1))

print(le_target.inverse_transform(new_predictions))

最后，您的樹預測：

['THBKK']

Answer 2

這里有一些東西可以快速說明這一點。 在實踐中我不會這樣做，可能會有一些錯誤。 例如，我認為如果測試集中有看不見的類，這將失敗。

#Prepare the dataset.
X_train, X_test, y_train, y_test = train_test_split(
    dataset, target, test_size=0.3, random_state=0)

#Convert all our data to numeric values, so we can use the .fit function.
#For that, we use LabelEncoder
le_target = preprocessing.LabelEncoder()
y_train = le_target.fit_transform(y_train)
y_test = le_target.transform(y_test)

# Now create a separate encoder for each of your features:
encoders = {}
for feature in ["Origin", "Consignor Code", "Consignee Code", "Carrier Code"]:
# NOTE: The LabelEncoder docs state clearly at the start that you shouldn't be using it on your inputs. I'm not going to get into that here though but just be aware that it's not a good encoding.
    encoders[feature] = preprocessing.LabelEncoder()
    X_train[feature] = encoders[feature].fit_transform(X_train[feature])
    X_test[feature] = encoders[feature].transform(X_test[feature])    

#Prepare the model and .fit it.
model = RandomForestClassifier()
model.fit(X_train, y_train)

#Make a prediction on the test set.
predictions = model.predict(X_test)

le_target.inverse_transform(predictions)

此處的關鍵概念是為您的特征使用單獨的編碼器，因為這些編碼器對象會記住如何對該特征進行編碼。 這是在fit階段完成的。 然后，您需要對任何新數據調用transform以正確編碼。

scikit-learn - 預測新輸入的訓練模型

問題描述

2 個解決方案

解決方案1
2 已采納 2020-02-11 14:49:46

最后，您的樹預測：

解決方案2
1 2020-02-11 14:38:20

scikit-learn - 預測新輸入的訓練模型

問題描述

2 個解決方案

解決方案1 2 已采納 2020-02-11 14:49:46

最后，您的樹預測：

解決方案2 1 2020-02-11 14:38:20

解決方案1
2 已采納 2020-02-11 14:49:46

解決方案2
1 2020-02-11 14:38:20