[英]Ensemble of machine learning models in scikit-learn
group feature_1 feature_2 year dependent_variable
group_a 12 19 2010 0.4
group_a 11 13 2011 0.9
group_a 10 5 2012 1.2
group_a 16 9 2013 3.2
group_b 8 29 2010 0.6
group_b 9 33 2011 0.1
group_b 111 15 2012 2.1
group_b 16 19 2013 12.2
在上面的數據框中,我想使用feature_1
, feature_2
來預測dependent_variable
。 為此,我想構建兩個模型:在第一個模型中,我想為每個組構建一個單獨的模型。 在第二個模型中,我想使用所有可用的數據。 在這兩種情況下,2010 年至 2012 年的數據將用於訓練,2013 年將用於測試。
如何使用上述兩個模型構建集成模型? 數據是一個玩具數據集,但在真實數據集中,會有更多的組、年份和特征。 特別是,我對一種適用於 scikit-learn 兼容模型的方法感興趣。
創建集成模型將有多個步驟。
首先分別創建兩個模型。 對於第一個模型,按組拆分數據並訓練兩個單獨的模型,然后將兩個模型連接到一個函數中。 對於第二個模型,可以保留完整的數據(除了刪除測試數據)。 然后,創建另一種方法將其他兩個模型連接到一個集成模型中。
為了演示,我將首先導入必要的模塊並加載數據框:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
data_str = """group,feature_1,feature_2,year,dependent_variable
group_a,12,19,2010,0.4
group_a,11,13,2011,0.9
group_a,10,5,2012,1.2
group_a,16,9,2013,3.2
group_b,8,29,2010,0.6
group_b,9,33,2011,0.1
group_b,111,15,2012,2.1
group_b,16,19,2013,12.2"""
data_list = [row.split(",") for row in data_str.split("\n")]
data = pd.DataFrame(data_list[1:], columns = data_list[0])
train = data.loc[data["year"] != "2013"]
test = data.loc[data["year"] == "2013"]
這將使用 RandomForestRegressor 集成模型,但可以使用任何回歸模型。 另外,需要注意的是,這里使用的數據框與給定的數據框不同,因為該數據框的行從 0 開始索引,而不是按組索引,而組是數據框中的一列。
構建第一個模型:
前兩個步驟如下完成:
# Splitting Data
train_a = train.loc[train["group"] == "group_a"]
train_b = train.loc[train["group"] == "group_b"]
test_a = test.loc[test["group"] == "group_a"]
test_b = test.loc[test["group"] == "group_b"]
# Training Two Models
model_a = RandomForestRegressor()
model_a.fit(train_a.drop(["dependent_variable", "year", "group"], axis = "columns"), train_a.dependent_variable)
model_b = RandomForestRegressor()
model_b.fit(train_b.drop(["dependent_variable", "year", "group"], axis = "columns"), train_b.dependent_variable)
然后,他們的預測方法可以結合在一起:
def individual_predictor(group, feature_1, feature_2):
if group == "group_a": return model_a.predict([[feature_1, feature_2]])[0]
elif group == "group_b": return model_b.predict([[feature_1, feature_2]])[0]
這將分別接收一組和兩個特征並返回預測。 這可以適應任何需要的輸入和輸出類型。
要創建第二個模型,請將數據保留為整體,只訓練一個模型,這也消除了加入模型的必要性:
model = RandomForestRegressor()
model.fit(train.drop(["dependent_variable", "year", "group"], axis = "columns"), train.dependent_variable)
最后,您可以通過平均預測方法的結果將模型連接成一個集成模型:
def ensemble_predict(group, feature_1, feature_2):
return (individual_predictor(group, feature_1, feature_2) + model.predict([[feature_1, feature_2]])[0]) / 2
同樣,這需要一個組和兩個特征,然后返回結果。 這可能需要適應另一種格式,例如獲取輸入列表並輸出預測列表。
這個使用 2 個回歸器,RandomForestRegressor 和 GradientBoostingRegressor。
我為r2_score計算添加了2013年的數據,它必須大於1。還添加了其他年份的數據。 復制文本並保存到txt文件。
首先我們處理數據文件,通過數據幀操作分離訓練和測試。 然后,我們為每個回歸器創建一個模型,模型 1.1 和 1.2 分別用於組“a”和“b”。 然后為所有數據建模 2。 創建模型后,我們將其保存到磁盤以供以后處理。
創建模型后,我們使用所有測試數據和單個數據進行預測。 還會打印度量 r2_square 和 MAE。
最后一部分是通過加載模型文件並讓它從測試中預測來測試它。 內存和磁盤中模型的預測應該是相同的。 還有一個示例輸入類型以及如何在自定義預測功能中使用它。
另請參閱代碼中的文檔字符串和注釋以了解其工作原理。
data.txt
group feature_1 feature_2 year dependent_variable
group_a 12 19 2010 0.4
group_a 7 15 2010 1.5
group_a 11 13 2011 0.9
group_a 8 8 2011 2.1
group_a 10 5 2012 1.2
group_a 11 9 2012 2.6
group_a 16 9 2013 3.2
group_a 8 10 2013 2.6
group_b 8 29 2010 0.6
group_b 11 18 2010 1.5
group_b 9 33 2011 0.1
group_b 20 15 2011 2.8
group_b 111 15 2012 2.1
group_b 99 10 2012 3.6
group_b 16 19 2013 12.2
group_b 4 8 2013 5.1
myensemble.py
"""sklearn ensemble modeling.
Dependencies:
* sklearn
* pandas
* numpy
References:
* https://scikit-learn.org/stable/modules/classes.html?highlight=ensemble#module-sklearn.ensemble
* https://pandas.pydata.org/docs/user_guide/indexing.html
"""
from typing import List, Union, Optional
import pickle # for saving file to disk
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np
def make_model(regressor, regname: str, modelfn: str, dfX: pd.DataFrame, dfy: pd.DataFrame):
"""Creates a model.
Args:
regressor: Can be RandomForestRegressor or GradientBoostingRegressor.
regname: Regressor name.
dfX: The features in pandas dataframe.
dfy: The target in pandas dataframe.
Returns:
Model
"""
X = dfX.to_numpy()
y = dfy.to_numpy()
model = regressor(random_state=0)
model.fit(X, y)
# Save model.
with open(f'{regname}_{modelfn}', 'wb') as f:
pickle.dump(model, f)
return model
def get_prediction(model, test: Union[List, pd.DataFrame, np.ndarray]) -> Optional[np.ndarray]:
"""Returns prediction based on model and test input or None.
"""
if isinstance(test, List) or isinstance(test, np.ndarray):
return model.predict([test])
if isinstance(test, pd.DataFrame):
return model.predict(np.array(test))
return None
def model_and_prediction(df: pd.DataFrame, regressor, regname: str, modelfn: str):
"""Build model and show prediction and metrics.
To build a model we need a training data X with features
and data y with target or dependent values.
Args:
df: A dataframe.
regressor: Can be RandomForestRegressor or GradientBoostingRegressor.
regname: The regressor name.
modelfn: The filename where model will be saved to disk.
Returns:
None
"""
features = ['feature_1', 'feature_2']
# 1. Get the train dataframe
train = df.loc[df.year != 2013] # exclude 2013 in training data
train_feature = train[features] # select the features column
train_target = train.dependent_variable # select the dependent column
model = make_model(regressor, regname, modelfn, train_feature, train_target)
# 2. Get the test dataframe
test = df.loc[df.year == 2013] # only include 2013 in test data
test_feature = test[features]
test_target = test.dependent_variable
# 3. Get the prediction from all rows in test feature. See step 5
# for single data prediction.
prediction: np.ndarray = model.predict(np.array(test_feature))
print(f'test feature:\n{np.array(test_feature)}')
print(f'test prediction: {prediction}') # prediction[0] ...
print(f'test target: {np.array(test_target)}')
# 4. metrics
print(f'r2_score: {r2_score(test_target, prediction)}')
print(f'mean_absolute_error: {mean_absolute_error(test_target, prediction)}\n')
# 5. Get prediction from the first row of test features.
prediction_1: np.ndarray = model.predict(np.array(test_feature.iloc[[0]]))
print(f'1st row test:\n{test_feature.iloc[[0]]}')
print(f'1st row test prediction array: {prediction_1}')
print(f'1st row test prediction value: {prediction_1[0]}\n') # get the element value
def main():
datafn = 'data.txt'
df = pd.read_fwf(datafn)
print(df.to_string(index=False))
# A. Create models for each type of regressor.
regressors = [(RandomForestRegressor, 'RandomForrest'),
(GradientBoostingRegressor, 'GradientBoosting')]
for (r, name) in regressors:
print(f'::: Regressor: {name} :::\n')
# Model 1 using group_a
print(':: MODEL 1.1 ::')
grp = 'group_a'
modelfn = f'{grp}.pkl' # filename of model to be save to disk
dfa = df.loc[df.group == grp] # select group
model_and_prediction(dfa, r, name, modelfn)
# Model 1 using group_b
print(':: MODEL 1.2 ::')
grp = 'group_b'
modelfn = f'{grp}.pkl'
dfb = df.loc[df.group == grp]
model_and_prediction(dfb, r, name, modelfn)
# Model 2 using group a and b
print(':: MODEL 2 ::')
grp = 'group_ab'
modelfn = f'{grp}.pkl'
dfab = df.loc[(df.group == 'group_a') | (df.group == 'group_b')]
model_and_prediction(dfab, r, name, modelfn)
# B. Test saved model file prediction.
print('::: Prediction from loaded model :::')
mfn = 'GradientBoosting_group_ab.pkl'
print(f'model: gradient boosting model 2, {mfn}')
with open(mfn, 'rb') as f:
loaded_model = pickle.load(f)
# test: group_b 4 8 2013 5.1
test = [4, 8]
prediction = loaded_model.predict([test])
print(f'test: {test}')
print(f'prediction: {prediction[0]}\n')
# C. Use get_prediction().
# input from list
test = [4, 8]
prediction = get_prediction(loaded_model, test)
print(f'test from list input:\n{test}')
print(f'prediction from get_prediction() with list input: {prediction}\n')
# input from dataframe
testdata = {
'feature_1': [8, 12],
'feature_2': [19, 15],
}
testdf = pd.DataFrame(testdata)
testrow = testdf.iloc[[0]] # first row [8, 19]
prediction = get_prediction(loaded_model, testrow)
print(f'test from df input:\n{testrow}')
print(f'prediction from get_prediction() with df input: {prediction}\n')
testrow = testdf.iloc[[1]] # second row [12, 15]
prediction = get_prediction(loaded_model, testrow)
print(f'test from df input:\n{testrow}')
print(f'prediction from get_prediction() with df input: {prediction}\n')
# input from numpy
test = [8, 9]
testnp = np.array(test)
prediction = get_prediction(loaded_model, testnp)
print(f'test from numpy input:\n{testnp}')
print(f'prediction from get_prediction() with numpy input: {prediction}\n')
if __name__ == '__main__':
main()
group feature_1 feature_2 year dependent_variable
group_a 12 19 2010 0.4
group_a 7 15 2010 1.5
group_a 11 13 2011 0.9
group_a 8 8 2011 2.1
group_a 10 5 2012 1.2
group_a 11 9 2012 2.6
group_a 16 9 2013 3.2
group_a 8 10 2013 2.6
group_b 8 29 2010 0.6
group_b 11 18 2010 1.5
group_b 9 33 2011 0.1
group_b 20 15 2011 2.8
group_b 111 15 2012 2.1
group_b 99 10 2012 3.6
group_b 16 19 2013 12.2
group_b 4 8 2013 5.1
::: Regressor: RandomForrest :::
:: MODEL 1.1 ::
test feature:
[[16 9]
[ 8 10]]
test prediction: [1.811 2.186]
test target: [3.2 2.6]
r2_score: -10.67065000000004
mean_absolute_error: 0.9015000000000026
1st row test:
feature_1 feature_2
6 16 9
1st row test prediction array: [1.811]
1st row test prediction value: 1.8109999999999986
:: MODEL 1.2 ::
test feature:
[[16 19]
[ 4 8]]
test prediction: [2.116 2.408]
test target: [12.2 5.1]
r2_score: -3.3219170799444546
mean_absolute_error: 6.388
1st row test:
feature_1 feature_2
14 16 19
1st row test prediction array: [2.116]
1st row test prediction value: 2.116000000000001
:: MODEL 2 ::
test feature:
[[16 9]
[ 8 10]
[16 19]
[ 4 8]]
test prediction: [2.425 2.145 1.01 1.958]
test target: [ 3.2 2.6 12.2 5.1]
r2_score: -1.3250936994738867
mean_absolute_error: 3.8905000000000016
1st row test:
feature_1 feature_2
6 16 9
1st row test prediction array: [2.425]
1st row test prediction value: 2.4249999999999985
::: Regressor: GradientBoosting :::
:: MODEL 1.1 ::
test feature:
[[16 9]
[ 8 10]]
test prediction: [2.59996945 2.21271005]
test target: [3.2 2.6]
r2_score: -1.8335008778823685
mean_absolute_error: 0.4936602458577084
1st row test:
feature_1 feature_2
6 16 9
1st row test prediction array: [2.59996945]
1st row test prediction value: 2.59996945439128
:: MODEL 1.2 ::
test feature:
[[16 19]
[ 4 8]]
test prediction: [1.99807124 2.63511811]
test target: [12.2 5.1]
r2_score: -3.3703627491779713
mean_absolute_error: 6.333405322236132
1st row test:
feature_1 feature_2
14 16 19
1st row test prediction array: [1.99807124]
1st row test prediction value: 1.9980712422931164
:: MODEL 2 ::
test feature:
[[16 9]
[ 8 10]
[16 19]
[ 4 8]]
test prediction: [3.60257456 2.26208935 0.402739 2.10950224]
test target: [ 3.2 2.6 12.2 5.1]
r2_score: -1.538939968014979
mean_absolute_error: 3.882060991360607
1st row test:
feature_1 feature_2
6 16 9
1st row test prediction array: [3.60257456]
1st row test prediction value: 3.6025745572622014
::: Prediction from loaded model :::
model: gradient boosting model 2, GradientBoosting_group_ab.pkl
test: [4, 8]
prediction: 2.1095022367629728
test from list input:
[4, 8]
prediction from get_prediction() with list input: [2.10950224]
test from df input:
feature_1 feature_2
0 8 19
prediction from get_prediction() with df input: [0.50307204]
test from df input:
feature_1 feature_2
1 12 15
prediction from get_prediction() with df input: [1.46058714]
test from numpy input:
[8 9]
prediction from get_prediction() with numpy input: [2.30007317]
首先,使用時間序列算法(僅使用日期變量和因變量)、fbprophet(使用特征+日期+因變量)、基於樹的回歸算法(如 CatBoost/XGBoost/LightGBM)創建模型(使用特征+日期+因變量)。
使用每個提到的算法為每個組創建模型(自下而上的方法)。 不同的模型將針對不同的群體表現良好。 根據模型的性能取加權平均值。 假設 group_a 預測在 Catboost、fbprophet 和指數移動平均的情況下表現最好,使用與從這些模型得出的准確度成比例的權重。
您可以聚合組級模型的結果以獲得聚合結果。 您還可以在聚合數據上創建單獨的模型(按年份匯總)。
如果我正確理解了您問題的最后一行,那么在這些年的背景下,您希望通過模型 1 捕捉給定日歷年的趨勢,並通過模型 2 捕捉多年的趨勢。 模型 2 可能是個問題,因為您提到了scikit-learn 兼容模型。
因此,我將嘗試解釋我將采取的方法。
模型 1 非常簡單,它是一個回歸問題,因此選擇最佳回歸模型應該不是問題。 您可以通過查看給定日歷年的結果來發現這一點。
模型 2 是您想要捕獲時間序列特征的地方,有點像YoY之類的東西。 雖然 SKLearn 中沒有任何模型可以像 ARIMA 或 RNN 那樣直接捕獲時間參數,但有一些方法可以使用 SKLearn 模型進行預測。 其中很多依賴於特征工程。 您可以使用特征 1 和 2,對它們進行排序、移動,然后進行差異來創建新特征,例如 1a 和 2a,然后可以將其與任何回歸模型一起使用。 這些新功能將捕捉時間本質。 我可以在這里寫一篇很長的文章,但我認為你會發現這個鏈接寫得更好。
現在將 2 個模型組合在一起。 由於這是一個回歸問題,我認為最好的方法是為這兩個模型的輸出分配權重,假設模型 1 為alpha ,模型 2 為beta 。將alpha和beta視為超參數。 使用數據調整它們。
這應該與 SKLearn 模型很好地融合在一起。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.