简体   繁体   English

如何在 python 中使用 sklearn 回归器正确预测目标变量?

[英]How to correctly predict target variables with sklearn regressor in python?

I want to predict future prices from the marketing time series data.我想根据营销时间序列数据预测未来价格。 To do so, I use sklearn for my task because it is more flexible than statsmodel and fbprophet .为此,我使用sklearn来完成我的任务,因为它比statsmodelfbprophet更灵活。 However, for preprocessing, I removed seasonality from time-series data by taking logarithmic values for both selected features and targeted variables, then use log values and lag values to make predictions.然而,对于预处理,我通过对选定特征和目标变量取对数值来消除时间序列数据的季节性,然后使用对数值和滞后值进行预测。 What I don't understand is how each individual feature (it has lag value and log value) contribute to predicting target variables.我不明白的是每个单独的特征(它具有滞后值和对数值)如何有助于预测目标变量。 In the prediction problem, first, we normalize and preprocess the features, then selectively choose the features by its features importance to reduce dims of training data, then train the model and get the corresponding prediction.在预测问题中,首先,我们对特征进行归一化和预处理,然后根据特征重要性选择性地选择特征以减少训练数据的dims,然后训练model并得到相应的预测。

new update新更新

In a time-series setting, however, we need to tackle seasonality first, then use log value and lag values of the features to make predictions.然而,在时间序列设置中,我们需要首先解决季节性问题,然后使用特征的对数值和滞后值进行预测。 In my attempt, I just simplify the process by not using many features (didn't use feature importance), just selected two features, and try to predict target variables (where each feature has its log values and lag values in order to remove seasonality).在我的尝试中,我只是通过不使用许多特征(没有使用特征重要性)来简化过程,只选择了两个特征,并尝试预测目标变量(其中每个特征都有其对数值和滞后值以消除季节性). why my way of predicting the target variable is not efficient?为什么我预测目标变量的方法效率不高? what would be the better approach to do this?这样做的更好方法是什么? Can anyone point me out any possible suggestions or coding remedy?谁能指出任何可能的建议或编码补救措施?

thanks to @smci who encouraged me to specify the question and focus on one problem only in my post.感谢@smci 鼓励我在我的帖子中指定问题并只关注一个问题。 I did specify the data source link and used time-series data as follow:我确实指定了数据源链接并使用了时间序列数据,如下所示:

time-series data was taken from http://statistics.mla.com.au/Report/List which is a market information statistical database.时间序列数据取自市场信息统计数据库http://statistics.mla.com.au/Report/List I shared the reproducible data in this link and I shared my full coding attempt in this gist在此链接中共享了可重现的数据,并在此要点中共享了我的完整编码尝试

my attempt我的尝试

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor

url = "https://gist.githubusercontent.com/adamFlyn/f71e2e0e66303df23dfc2f37ec98e8c7/raw/ba9e871e90201eb504e30127e99cf6179c3e3b18/tradedf.csv"

df = pd.read_csv(url, parse_dates=['dates'])
df.drop(columns=['Unnamed: 0'], inplace=True)

df['log_eyci'] = np.log(df.eyci)  ### Log value
df['log_aus_avg_rain'] = np.log(df['aus_avg_rain'])  ### Log value

for i in range(3):
    df[f'avgRain_lag_{i+1}'] = df['aus_avg_rain'].shift(i+1)   
    df.dropna(inplace=True)
    df[f'log_avgRain_lag_{i+1}'] = np.log(df[f'avgRain_lag_{i+1}'])
    
for i in range(3):
    df[f'eyci_lag_{i+1}'] = df.eyci.shift(i+1)   
    df.dropna(inplace=True)
    df[f'log_eyci_lag_{i+1}'] = np.log(df[f'eyci_lag_{i+1}'])
    df[f'log_difference_{i+1}'] = df.log_eyci - df[f'log_eyci_lag_{i+1}']

X,Y = df[['log_difference_2', 'log_difference_3', 'aus_avg_rain', 'aus_slg_fmCatl']] , df['log_difference_1']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, shuffle=False, random_state=42)

fit the model with AdaBoost Regressor用 AdaBoost 回归器拟合 model

mdl_adaboost = AdaBoostRegressor(n_estimators=100, learning_rate=0.01)
mdl_adaboost.fit(X_train, Y_train)   # Fit the data
pred = mdl_adaboost.predict(X_test)  # make predictions

when I tried to make a plot for prediction output, I tried below当我尝试为预测 output 制作 plot 时,我在下面尝试了

## make plot
test_size = X_test.shape[0]
plt.plot(list(range(test_size)), np.exp(df.tail(test_size).log_eyci_lag_1  + pred), label='predicted', color='red')
plt.plot(list(range(test_size)), df.tail(test_size).eyci, label='real', color='blue')
plt.legend(loc='best')
plt.title('Predicted vs Real with log difference values')

@smci pointed out that using train, test = X[0:size], X[size:len(X)] is not good idea. @smci 指出使用train, test = X[0:size], X[size:len(X)]不是好主意。 I am wondering how should I correct the limitation of my approach.我想知道我应该如何纠正我的方法的局限性。

The one problem I am asking in this question, how to predict target variables from time-series data which might have seasonality.我在这个问题中提出的一个问题是如何从可能具有季节性的时间序列数据中预测目标变量。 I did use log and lag values for features and target variables.我确实对特征和目标变量使用了对数和滞后值。 Now I am little lost how do I use those for prediction, and how those might or might not contribute to predict target variables.现在我几乎不知道如何使用它们进行预测,以及它们如何可能或可能不会有助于预测目标变量。

intuition behind this这背后的直觉

I developed my intuition to predict commodity prices from this site , so far, my way of modeling this task remains problematic.我凭直觉从这个网站预测商品价格,到目前为止,我对这项任务的建模方式仍然存在问题。 I thank @smci to bring up this source as well.我也感谢@smci 提出这个来源 Can anyone suggest a possible coding remedy or the right way to make this type of prediction in scikit-learn ?谁能建议一种可能的编码补救措施或在scikit-learn中进行此类预测的正确方法? Any idea?任何的想法?

new update: objective :新更新:目标

I used the Australian market information database, what I am trying to do is predict Australian beef price, like this site shows .我使用了澳大利亚市场信息数据库,我想做的是预测澳大利亚牛肉价格,就像这个网站显示的那样。 Historical marketing prices data is from the Australian marketing information database, and I am going to forecast Australian beef price by taking simple features (like cattle slaughter number, cattle production, and so on).历史营销价格数据来自澳大利亚营销信息数据库,我将通过简单的特征(如牛屠宰数量、牛产量等)来预测澳大利亚牛肉价格。 Since I am using monthly data, I think taking monthly seasonality would be fine.由于我使用的是月度数据,我认为采用月度季节性数据会很好。 Again thanks a lot to @smci for pushing me to clarify my post and his helpful feedback.再次非常感谢@smci 促使我澄清我的帖子和他的有用反馈。

This is mostly offtopic to SO and very broad, you're asking multiple questions spanning DataScience.SE , CrossValidated , how to use detrending, which type of model to use, how to use rolling-window technique on a single timeseries dataset to generate multiple (train, test) slices , where to get monthly datasets for the extrinsic variables below:这主要与 SO 无关并且非常广泛,您问的是多个问题,包括DataScience.SECrossValidated如何使用去趋势、使用哪种类型的 model、如何在单个时间序列数据集上使用滚动窗口技术生成多个(train, test) slices ,从哪里获取以下外部变量的每月数据集:

  • Your dataset (please add citation) is monthly (wholesale) USDA beef prices over 2015-01... 2020-08.您的数据集(请添加引文)是 2015 年 1 月... 2020 年 8 月的每月(批发)美国农业部牛肉价格。 Are these prices from Australia ( https://www.agric.wa.gov.au/newsletters/wabc/western-australian-beef-commentary-issue-13?page=0%2C2 ), or the US?这些价格是来自澳大利亚 ( https://www.agric.wa.gov.au/newsletters/wabc/western-australian-beef-commentary-issue-13?page=0%2C2 ) 还是美国? (Please add citation, data dictionary to explain columns, etc.). (请添加引文、数据字典解释列等)。 It's good to develop an intuition for what you're trying to model, not just throw more data and more complex models at it.对您正在尝试的事情培养一种直觉是很好的 model,而不仅仅是向它投入更多数据和更复杂的模型。

  • and you want to predict future prices for 12-18mths: 2020-09.. 2022-02并且您想预测 12-18 个月的未来价格: 2020-09.. 2022-02

  • So I expect there will be both:所以我希望两者都会有:

    • annual seasonality年度季节性
    • longer-term economic supply-and-demand fluctuations长期经济供需波动
      • dependence on US(?)/Aus economy对美国(?)/澳大利亚经济的依赖
      • dependence on whichever foreign economies US(?)/Aus exports each particular type of beef to (China, Japan, Korea et al.)依赖于任何外国经济体美国(?)/澳大利亚将每种特定类型的牛肉出口到(中国,日本,韩国等)
    • other extrinsic events (recessions, weather crises, tariffs, subsidies, US soybean trade wars, etc.) which simply can't be predicted from the historical beef price values (and if you throw more historical datasets at it, or go further back in time, you'll only clog up your model without adding predictive power for the future).其他外部事件(经济衰退、天气危机、关税、补贴、美国大豆贸易战等)根本无法从历史牛肉价格值中预测(如果你投入更多历史数据集,或者 go 更进一步时间,你只会堵塞你的 model 而不会增加对未来的预测能力)。
  • so if you want more accuracy you really you want a macromodel of all these extrinsic things - not just the raw historical dataset values themselves.所以如果你想要更高的准确性,你真的需要一个包含所有这些外部事物的宏模型——而不仅仅是原始历史数据集值本身。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM