简体   繁体   English

scikit-learn 中的机器学习模型集合

[英]Ensemble of machine learning models in scikit-learn

group        feature_1        feature_2       year            dependent_variable
group_a         12               19           2010               0.4
group_a         11               13           2011               0.9
group_a         10               5            2012               1.2
group_a         16               9            2013               3.2
group_b         8               29            2010               0.6
group_b         9               33            2011               0.1 
group_b         111             15            2012               2.1 
group_b         16              19            2013               12.2  

In the dataframe above, I want to use feature_1 , feature_2 to predict dependent_variable .在上面的数据框中,我想使用feature_1feature_2来预测dependent_variable To do this, I want to construct two models: In the first model, I want to construct a separate model for each group.为此,我想构建两个模型:在第一个模型中,我想为每个组构建一​​个单独的模型。 In the second model, I want to use all the available data.在第二个模型中,我想使用所有可用的数据。 In both cases, data from the years 2010 to 2012 will be used for training and 2013 will be used for testing.在这两种情况下,2010 年至 2012 年的数据将用于训练,2013 年将用于测试。

How can I construct an ensemble model using the two models outlined above?如何使用上述两个模型构建集成模型? The data is a toy dataset but in the real dataset, there will be a lot more groups, years and features.数据是一个玩具数据集,但在真实数据集中,会有更多的组、年份和特征。 In particular, I am interested in an approach that will work with scikit-learn compatible models.特别是,我对一种适用于 scikit-learn 兼容模型的方法感兴趣。

There will be multiple steps to creating an ensemble model.创建集成模型将有多个步骤。

Start by creating the two models individually.首先分别创建两个模型。 For the first model, split the data by group and train two individual models, then join the two models together in a function.对于第一个模型,按组拆分数据并训练两个单独的模型,然后将两个模型连接到一个函数中。 For the second model, the data can be left in its entirety (aside from removing testing data).对于第二个模型,可以保留完整的数据(除了删除测试数据)。 Then, create another method to join the other two models into one ensemble model.然后,创建另一种方法将其他两个模型连接到一个集成模型中。

To demonstrate, I'll start by importing the necessary modules and loading in the dataframe :为了演示,我将首先导入必要的模块并加载数据框

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

data_str = """group,feature_1,feature_2,year,dependent_variable
group_a,12,19,2010,0.4
group_a,11,13,2011,0.9
group_a,10,5,2012,1.2
group_a,16,9,2013,3.2
group_b,8,29,2010,0.6
group_b,9,33,2011,0.1 
group_b,111,15,2012,2.1 
group_b,16,19,2013,12.2"""

data_list = [row.split(",") for row in data_str.split("\n")]
data = pd.DataFrame(data_list[1:], columns = data_list[0])

train = data.loc[data["year"] != "2013"]
test = data.loc[data["year"] == "2013"]

This will be using a RandomForestRegressor ensemble model, but any regression model can be used.这将使用 RandomForestRegressor 集成模型,但可以使用任何回归模型。 In addition, it should be noted that the dataframe used here differs from the given dataframe in that this dataframe has its rows indexed from 0 rather than being indexed by group, and group is instead a column within the dataframe.另外,需要注意的是,这里使用的数据框与给定的数据框不同,因为该数据框的行从 0 开始索引,而不是按组索引,而组是数据框中的一列。

To construct the first model:构建第一个模型:

  1. split the data into data for group a and for group b将数据拆分为 a 组和 b 组的数据
  2. train two independent models训练两个独立的模型
  3. join the models加入模型

The first two steps are done below:前两个步骤如下完成:

# Splitting Data
train_a = train.loc[train["group"] == "group_a"]
train_b = train.loc[train["group"] == "group_b"]
test_a = test.loc[test["group"] == "group_a"]
test_b = test.loc[test["group"] == "group_b"]

# Training Two Models
model_a = RandomForestRegressor()
model_a.fit(train_a.drop(["dependent_variable", "year", "group"], axis = "columns"), train_a.dependent_variable)
model_b = RandomForestRegressor()
model_b.fit(train_b.drop(["dependent_variable", "year", "group"], axis = "columns"), train_b.dependent_variable)

Then, their predict methods can be joined together:然后,他们的预测方法可以结合在一起:

def individual_predictor(group, feature_1, feature_2):
    if group == "group_a": return model_a.predict([[feature_1, feature_2]])[0]
    elif group == "group_b": return model_b.predict([[feature_1, feature_2]])[0]

This will take in a group and two features individually and return the prediction.这将分别接收一组和两个特征并返回预测。 This can be adapted to whatever input and output type is necessary.这可以适应任何需要的输入和输出类型。

To create the second model , leave the data as whole and only train one model, which also removes the necessity to join the models:要创建第二个模型,请将数据保留为整体,只训练一个模型,这也消除了加入模型的必要性:

model = RandomForestRegressor()
model.fit(train.drop(["dependent_variable", "year", "group"], axis = "columns"), train.dependent_variable)

Finally, you can join the models together into an ensemble model by averaging the result of their predict methods:最后,您可以通过平均预测方法的结果将模型连接成一个集成模型

def ensemble_predict(group, feature_1, feature_2):
    return (individual_predictor(group, feature_1, feature_2) + model.predict([[feature_1, feature_2]])[0]) / 2

Again, this takes in a group and two features then returns the result.同样,这需要一个组和两个特征,然后返回结果。 This will likely need to be adapted into another format, such as taking in a list of list of inputs and outputting a list of predictions.这可能需要适应另一种格式,例如获取输入列表并输出预测列表。

This one uses 2 regressors, RandomForestRegressor, and GradientBoostingRegressor.这个使用 2 个回归器,RandomForestRegressor 和 GradientBoostingRegressor。

I Added 2013 data for r2_score calculation, it must be more than 1. Also added data from other years.我为r2_score计算添加了2013年的数据,它必须大于1。还添加了其他年份的数据。 Copy the text and save to txt file.复制文本并保存到txt文件。

First we process the data file, separate the train and test by dataframe manipulation.首先我们处理数据文件,通过数据帧操作分离训练和测试。 We then create a model for each regressor with model 1.1 and 1.2 for group "a" and "b" respectively.然后,我们为每个回归器创建一个模型,模型 1.1 和 1.2 分别用于组“a”和“b”。 Then model 2 for all data.然后为所有数据建模 2。 After creating the model we then save it to disk for later processing.创建模型后,我们将其保存到磁盘以供以后处理。

After the models are created we then make predictions using all the test data and a single data.创建模型后,我们使用所有测试数据和单个数据进行预测。 Metrics r2_square and MAE are also printed.还会打印度量 r2_square 和 MAE。

The last part is testing the model file by loading it and let it predict from a test.最后一部分是通过加载模型文件并让它从测试中预测来测试它。 Predictions from model in memory and disk should be the same.内存和磁盘中模型的预测应该是相同的。 Also there is a sample input types and how to use it in custom prediction function.还有一个示例输入类型以及如何在自定义预测功能中使用它。

See also the docstring and comments in the code on how this works.另请参阅代码中的文档字符串和注释以了解其工作原理。

data.txt

group        feature_1        feature_2       year            dependent_variable
group_a         12               19           2010               0.4
group_a          7               15           2010               1.5
group_a         11               13           2011               0.9
group_a          8               8            2011               2.1
group_a         10               5            2012               1.2
group_a         11               9            2012               2.6
group_a         16               9            2013               3.2
group_a         8               10            2013               2.6
group_b         8               29            2010               0.6
group_b         11              18            2010               1.5
group_b         9               33            2011               0.1 
group_b         20              15            2011               2.8 
group_b         111             15            2012               2.1 
group_b         99              10            2012               3.6
group_b         16              19            2013               12.2
group_b         4                8            2013               5.1

Code代码

myensemble.py

"""sklearn ensemble modeling.

Dependencies:
    * sklearn
    * pandas
    * numpy

References:
    * https://scikit-learn.org/stable/modules/classes.html?highlight=ensemble#module-sklearn.ensemble
    * https://pandas.pydata.org/docs/user_guide/indexing.html
"""


from typing import List, Union, Optional
import pickle  # for saving file to disk

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np


def make_model(regressor, regname: str, modelfn: str, dfX: pd.DataFrame, dfy: pd.DataFrame):
    """Creates a model.

    Args:
        regressor: Can be RandomForestRegressor or GradientBoostingRegressor.
        regname: Regressor name.
        dfX: The features in pandas dataframe.
        dfy: The target in pandas dataframe.

    Returns:
        Model
    """
    X = dfX.to_numpy()
    y = dfy.to_numpy()
    model = regressor(random_state=0)
    model.fit(X, y)

    # Save model.
    with open(f'{regname}_{modelfn}', 'wb') as f:
        pickle.dump(model, f)

    return model


def get_prediction(model, test: Union[List, pd.DataFrame, np.ndarray]) -> Optional[np.ndarray]:
    """Returns prediction based on model and test input or None.
    """
    if isinstance(test, List) or isinstance(test, np.ndarray):
        return model.predict([test])
    if isinstance(test, pd.DataFrame):
        return model.predict(np.array(test))
    return None


def model_and_prediction(df: pd.DataFrame, regressor, regname: str, modelfn: str):
    """Build model and show prediction and metrics.

    To build a model we need a training data X with features
    and data y with target or dependent values.

    Args:
        df: A dataframe.
        regressor: Can be RandomForestRegressor or GradientBoostingRegressor.
        regname: The regressor name.
        modelfn: The filename where model will be saved to disk.

    Returns:
        None
    """
    features = ['feature_1', 'feature_2']

    # 1. Get the train dataframe
    train = df.loc[df.year != 2013]  # exclude 2013 in training data
    train_feature = train[features]  # select the features column
    train_target = train.dependent_variable  # select the dependent column

    model = make_model(regressor, regname, modelfn, train_feature, train_target)

    # 2. Get the test dataframe
    test = df.loc[df.year == 2013]  # only include 2013 in test data
    test_feature = test[features]
    test_target = test.dependent_variable

    # 3. Get the prediction from all rows in test feature. See step 5
    # for single data prediction.
    prediction: np.ndarray = model.predict(np.array(test_feature))

    print(f'test feature:\n{np.array(test_feature)}')
    print(f'test prediction: {prediction}')  # prediction[0] ...
    print(f'test target: {np.array(test_target)}')

    # 4. metrics
    print(f'r2_score: {r2_score(test_target, prediction)}')
    print(f'mean_absolute_error: {mean_absolute_error(test_target, prediction)}\n')

    # 5. Get prediction from the first row of test features.
    prediction_1: np.ndarray = model.predict(np.array(test_feature.iloc[[0]]))
    print(f'1st row test:\n{test_feature.iloc[[0]]}')
    print(f'1st row test prediction array: {prediction_1}')
    print(f'1st row test prediction value: {prediction_1[0]}\n')  # get the element value


def main():
    datafn = 'data.txt'
    df = pd.read_fwf(datafn)
    print(df.to_string(index=False))

    # A. Create models for each type of regressor.
    regressors = [(RandomForestRegressor, 'RandomForrest'),
                  (GradientBoostingRegressor, 'GradientBoosting')]

    for (r, name) in regressors:
        print(f'::: Regressor: {name} :::\n')

        # Model 1 using group_a
        print(':: MODEL 1.1 ::')
        grp = 'group_a'
        modelfn = f'{grp}.pkl'  # filename of model to be save to disk
        dfa = df.loc[df.group == grp]  # select group
        model_and_prediction(dfa, r, name, modelfn)

        # Model 1 using group_b
        print(':: MODEL 1.2 ::')
        grp = 'group_b'
        modelfn = f'{grp}.pkl'
        dfb = df.loc[df.group == grp]
        model_and_prediction(dfb, r, name, modelfn)

        # Model 2 using group a and b
        print(':: MODEL 2 ::')
        grp = 'group_ab'
        modelfn = f'{grp}.pkl'
        dfab = df.loc[(df.group == 'group_a') | (df.group == 'group_b')]
        model_and_prediction(dfab, r, name, modelfn)

    # B. Test saved model file prediction.
    print('::: Prediction from loaded model :::')
    mfn = 'GradientBoosting_group_ab.pkl'
    print(f'model: gradient boosting model 2, {mfn}')

    with open(mfn, 'rb') as f:
        loaded_model = pickle.load(f)

    # test: group_b  4  8  2013  5.1    
    test = [4, 8]
    prediction = loaded_model.predict([test])
    print(f'test: {test}')
    print(f'prediction: {prediction[0]}\n')

    # C. Use get_prediction().

    # input from list
    test = [4, 8]
    prediction = get_prediction(loaded_model, test)
    print(f'test from list input:\n{test}')
    print(f'prediction from get_prediction() with list input: {prediction}\n')

    # input from dataframe
    testdata = {
        'feature_1': [8, 12],
        'feature_2': [19, 15],
    }
    testdf = pd.DataFrame(testdata)
    testrow = testdf.iloc[[0]]  # first row [8, 19]
    prediction = get_prediction(loaded_model, testrow)
    print(f'test from df input:\n{testrow}')
    print(f'prediction from get_prediction() with df input: {prediction}\n')

    testrow = testdf.iloc[[1]]  # second row [12, 15]
    prediction = get_prediction(loaded_model, testrow)
    print(f'test from df input:\n{testrow}')
    print(f'prediction from get_prediction() with df input: {prediction}\n')

    # input from numpy
    test = [8, 9]
    testnp = np.array(test)
    prediction = get_prediction(loaded_model, testnp)
    print(f'test from numpy input:\n{testnp}')
    print(f'prediction from get_prediction() with numpy input: {prediction}\n')


if __name__ == '__main__':
    main()

Output输出

  group  feature_1  feature_2  year  dependent_variable
group_a         12         19  2010                 0.4
group_a          7         15  2010                 1.5
group_a         11         13  2011                 0.9
group_a          8          8  2011                 2.1
group_a         10          5  2012                 1.2
group_a         11          9  2012                 2.6
group_a         16          9  2013                 3.2
group_a          8         10  2013                 2.6
group_b          8         29  2010                 0.6
group_b         11         18  2010                 1.5
group_b          9         33  2011                 0.1
group_b         20         15  2011                 2.8
group_b        111         15  2012                 2.1
group_b         99         10  2012                 3.6
group_b         16         19  2013                12.2
group_b          4          8  2013                 5.1
::: Regressor: RandomForrest :::

:: MODEL 1.1 ::
test feature:
[[16  9]
 [ 8 10]]
test prediction: [1.811 2.186]
test target: [3.2 2.6]
r2_score: -10.67065000000004
mean_absolute_error: 0.9015000000000026

1st row test:
   feature_1  feature_2
6         16          9
1st row test prediction array: [1.811]
1st row test prediction value: 1.8109999999999986

:: MODEL 1.2 ::
test feature:
[[16 19]
 [ 4  8]]
test prediction: [2.116 2.408]
test target: [12.2  5.1]
r2_score: -3.3219170799444546
mean_absolute_error: 6.388

1st row test:
    feature_1  feature_2
14         16         19
1st row test prediction array: [2.116]
1st row test prediction value: 2.116000000000001

:: MODEL 2 ::
test feature:
[[16  9]
 [ 8 10]
 [16 19]
 [ 4  8]]
test prediction: [2.425 2.145 1.01  1.958]
test target: [ 3.2  2.6 12.2  5.1]
r2_score: -1.3250936994738867
mean_absolute_error: 3.8905000000000016

1st row test:
   feature_1  feature_2
6         16          9
1st row test prediction array: [2.425]
1st row test prediction value: 2.4249999999999985

::: Regressor: GradientBoosting :::

:: MODEL 1.1 ::
test feature:
[[16  9]
 [ 8 10]]
test prediction: [2.59996945 2.21271005]
test target: [3.2 2.6]
r2_score: -1.8335008778823685
mean_absolute_error: 0.4936602458577084

1st row test:
   feature_1  feature_2
6         16          9
1st row test prediction array: [2.59996945]
1st row test prediction value: 2.59996945439128

:: MODEL 1.2 ::
test feature:
[[16 19]
 [ 4  8]]
test prediction: [1.99807124 2.63511811]
test target: [12.2  5.1]
r2_score: -3.3703627491779713
mean_absolute_error: 6.333405322236132

1st row test:
    feature_1  feature_2
14         16         19
1st row test prediction array: [1.99807124]
1st row test prediction value: 1.9980712422931164

:: MODEL 2 ::
test feature:
[[16  9]
 [ 8 10]
 [16 19]
 [ 4  8]]
test prediction: [3.60257456 2.26208935 0.402739   2.10950224]
test target: [ 3.2  2.6 12.2  5.1]
r2_score: -1.538939968014979
mean_absolute_error: 3.882060991360607

1st row test:
   feature_1  feature_2
6         16          9
1st row test prediction array: [3.60257456]
1st row test prediction value: 3.6025745572622014

::: Prediction from loaded model :::
model: gradient boosting model 2, GradientBoosting_group_ab.pkl
test: [4, 8]
prediction: 2.1095022367629728

test from list input:
[4, 8]
prediction from get_prediction() with list input: [2.10950224]

test from df input:
   feature_1  feature_2
0          8         19
prediction from get_prediction() with df input: [0.50307204]

test from df input:
   feature_1  feature_2
1         12         15
prediction from get_prediction() with df input: [1.46058714]

test from numpy input:
[8 9]
prediction from get_prediction() with numpy input: [2.30007317]

Firstly, create models using time series algorithms(use only date variable and dependent variable), fbprophet (uses features + date + dependent variable), Treebased regression algorithms like CatBoost/XGBoost/LightGBM (use features + date + dependent variable).首先,使用时间序列算法(仅使用日期变量和因变量)、fbprophet(使用特征+日期+因变量)、基于树的回归算法(如 CatBoost/XGBoost/LightGBM)创建模型(使用特征+日期+因变量)。

Using each of the mentioned algorithms create models for each group (bottom up approach).使用每个提到的算法为每个组创建模型(自下而上的方法)。 Different models will peform well for different groups.不同的模型将针对不同的群体表现良好。 Take a weighted mean based on models' performance.根据模型的性能取加权平均值。 Assume, group_a predictions perform best with Catboost, then with fbprophet and then with exponential moving average, use weights in proportion to accuracies derived from these models.假设 group_a 预测在 Catboost、fbprophet 和指数移动平均的情况下表现最好,使用与从这些模型得出的准确度成比例的权重。

You can aggregate results of group level models to get aggregated results.您可以聚合组级模型的结果以获得聚合结果。 You can also create separate models on aggregated data(summing up on year).您还可以在聚合数据上创建单独的模型(按年份汇总)。

If I understand the last line of your question correctly, in context with the years, you are looking to capture the trend in a given calendar year via model 1 and capture trend across multiple years via model 2 .如果我正确理解了您问题的最后一行,那么在这些年的背景下,您希望通过模型 1 捕捉给定日历年的趋势,通过模型 2 捕捉多年的趋势。 And model 2 in where it could be an issue because you mentioned scikit-learn compatible models .模型 2 可能是个问题,因为您提到了scikit-learn 兼容模型

So I'll try to explain the approach I would take.因此,我将尝试解释我将采取的方法。

Model 1 is pretty straight forward, it is a regression problem so selecting the best regression model should not be an issue.模型 1 非常简单,它是一个回归问题,因此选择最佳回归模型应该不是问题。 You can find that by seeing results in a given calendar year.您可以通过查看给定日历年的结果来发现这一点。

Model 2 is where you would like to capture time-series features, kind of like YoY sort of thing.模型 2 是您想要捕获时间序列特征的地方,有点像YoY之类的东西。 While there isn't any model in SKLearn to directly capture the time parameter like ARIMA or RNNs do, there are ways to use SKLearn Models to do the forecasting.虽然 SKLearn 中没有任何模型可以像 ARIMA 或 RNN 那样直接捕获时间参数,但有一些方法可以使用 SKLearn 模型进行预测。 And a lot of it depends on feature engineering.其中很多依赖于特征工程。 You could use features 1 and 2, sort them, shift them and then take a diff to create new features, say 1a and 2a, which could then be used with any regression model.您可以使用特征 1 和 2,对它们进行排序、移动,然后进行差异来创建新特征,例如 1a 和 2a,然后可以将其与任何回归模型一起使用。 These new features would capture the time essence.这些新功能将捕捉时间本质。 I could write a lengthy post on that here, but I think you'll find this link much better written.我可以在这里写一篇很长的文章,但我认为你会发现这个链接写得更好。

Now coming to ensembling the 2 models together.现在将 2 个模型组合在一起。 As this is a regression problem, the best way I feel would be to assign weightages to the output of both these models, lets say alpha for model 1 and beta for model 2. Treat alpha and beta as hyperparameters.由于这是一个回归问题,我认为最好的方法是为这两个模型的输出分配权重,假设模型 1 为alpha ,模型 2 为beta 。将alphabeta视为超参数。 Tune them using the data.使用数据调整它们。

This should make a pretty good ensemble with SKLearn models.这应该与 SKLearn 模型很好地融合在一起。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM