简体   繁体   中英

How to correctly predict target variables with sklearn regressor in python?

I want to predict future prices from the marketing time series data. To do so, I use sklearn for my task because it is more flexible than statsmodel and fbprophet . However, for preprocessing, I removed seasonality from time-series data by taking logarithmic values for both selected features and targeted variables, then use log values and lag values to make predictions. What I don't understand is how each individual feature (it has lag value and log value) contribute to predicting target variables. In the prediction problem, first, we normalize and preprocess the features, then selectively choose the features by its features importance to reduce dims of training data, then train the model and get the corresponding prediction.

new update

In a time-series setting, however, we need to tackle seasonality first, then use log value and lag values of the features to make predictions. In my attempt, I just simplify the process by not using many features (didn't use feature importance), just selected two features, and try to predict target variables (where each feature has its log values and lag values in order to remove seasonality). why my way of predicting the target variable is not efficient? what would be the better approach to do this? Can anyone point me out any possible suggestions or coding remedy?

thanks to @smci who encouraged me to specify the question and focus on one problem only in my post. I did specify the data source link and used time-series data as follow:

time-series data was taken from http://statistics.mla.com.au/Report/List which is a market information statistical database. I shared the reproducible data in this link and I shared my full coding attempt in this gist

my attempt

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor

url = "https://gist.githubusercontent.com/adamFlyn/f71e2e0e66303df23dfc2f37ec98e8c7/raw/ba9e871e90201eb504e30127e99cf6179c3e3b18/tradedf.csv"

df = pd.read_csv(url, parse_dates=['dates'])
df.drop(columns=['Unnamed: 0'], inplace=True)

df['log_eyci'] = np.log(df.eyci)  ### Log value
df['log_aus_avg_rain'] = np.log(df['aus_avg_rain'])  ### Log value

for i in range(3):
    df[f'avgRain_lag_{i+1}'] = df['aus_avg_rain'].shift(i+1)   
    df.dropna(inplace=True)
    df[f'log_avgRain_lag_{i+1}'] = np.log(df[f'avgRain_lag_{i+1}'])
    
for i in range(3):
    df[f'eyci_lag_{i+1}'] = df.eyci.shift(i+1)   
    df.dropna(inplace=True)
    df[f'log_eyci_lag_{i+1}'] = np.log(df[f'eyci_lag_{i+1}'])
    df[f'log_difference_{i+1}'] = df.log_eyci - df[f'log_eyci_lag_{i+1}']

X,Y = df[['log_difference_2', 'log_difference_3', 'aus_avg_rain', 'aus_slg_fmCatl']] , df['log_difference_1']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, shuffle=False, random_state=42)

fit the model with AdaBoost Regressor

mdl_adaboost = AdaBoostRegressor(n_estimators=100, learning_rate=0.01)
mdl_adaboost.fit(X_train, Y_train)   # Fit the data
pred = mdl_adaboost.predict(X_test)  # make predictions

when I tried to make a plot for prediction output, I tried below

## make plot
test_size = X_test.shape[0]
plt.plot(list(range(test_size)), np.exp(df.tail(test_size).log_eyci_lag_1  + pred), label='predicted', color='red')
plt.plot(list(range(test_size)), df.tail(test_size).eyci, label='real', color='blue')
plt.legend(loc='best')
plt.title('Predicted vs Real with log difference values')

@smci pointed out that using train, test = X[0:size], X[size:len(X)] is not good idea. I am wondering how should I correct the limitation of my approach.

The one problem I am asking in this question, how to predict target variables from time-series data which might have seasonality. I did use log and lag values for features and target variables. Now I am little lost how do I use those for prediction, and how those might or might not contribute to predict target variables.

intuition behind this

I developed my intuition to predict commodity prices from this site , so far, my way of modeling this task remains problematic. I thank @smci to bring up this source as well. Can anyone suggest a possible coding remedy or the right way to make this type of prediction in scikit-learn ? Any idea?

new update: objective :

I used the Australian market information database, what I am trying to do is predict Australian beef price, like this site shows . Historical marketing prices data is from the Australian marketing information database, and I am going to forecast Australian beef price by taking simple features (like cattle slaughter number, cattle production, and so on). Since I am using monthly data, I think taking monthly seasonality would be fine. Again thanks a lot to @smci for pushing me to clarify my post and his helpful feedback.

This is mostly offtopic to SO and very broad, you're asking multiple questions spanning DataScience.SE , CrossValidated , how to use detrending, which type of model to use, how to use rolling-window technique on a single timeseries dataset to generate multiple (train, test) slices , where to get monthly datasets for the extrinsic variables below:

  • Your dataset (please add citation) is monthly (wholesale) USDA beef prices over 2015-01... 2020-08. Are these prices from Australia ( https://www.agric.wa.gov.au/newsletters/wabc/western-australian-beef-commentary-issue-13?page=0%2C2 ), or the US? (Please add citation, data dictionary to explain columns, etc.). It's good to develop an intuition for what you're trying to model, not just throw more data and more complex models at it.

  • and you want to predict future prices for 12-18mths: 2020-09.. 2022-02

  • So I expect there will be both:

    • annual seasonality
    • longer-term economic supply-and-demand fluctuations
      • dependence on US(?)/Aus economy
      • dependence on whichever foreign economies US(?)/Aus exports each particular type of beef to (China, Japan, Korea et al.)
    • other extrinsic events (recessions, weather crises, tariffs, subsidies, US soybean trade wars, etc.) which simply can't be predicted from the historical beef price values (and if you throw more historical datasets at it, or go further back in time, you'll only clog up your model without adding predictive power for the future).
  • so if you want more accuracy you really you want a macromodel of all these extrinsic things - not just the raw historical dataset values themselves.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM