Using Linear Regression for Yearly distributed Time Series Data to get predictions after -N- years

Question

I am stuck with a very unique problem. I have Time Series Data where the data is given from the years 2009 to 2018. Problem is that I am to answer a very weird question using this data.

Data sheets contains the energy generation statistics of each Australian State/Territory in GWh ( Gigawatt hours) for the year 2009 to 2018.

There are following fields:


State: Names of different Australian states.
Fuel_Type:  The type of fuel which is consumed.
Category:  Determines whether a fuel is considered as a renewable or nonrenewable.
Years: Years which the energy consumptions are recorded.

Problem :

How can I use a linear regression model to predict what percentage of a state X say Victoria’s energy generation will come from y source say Renewable energy sources in the year Z suppose 2100 ?

How am I suppose to use a Linear Regression Model to solve the problem? This problem is beyond my reach.

Data is from this link

Answer 1

I think first you need to think about what your model should look like at the end: You probably want something that relates the dependent variable y (fraction of renewable energy) to your input features. And one of those features should probably be the year since you are interest in predicting how y changes if you vary this quantity. So a very basic linear model could be y = beta1 * x + beta0 with x being the year, beta1 and beta0 being the parameters you want to fit and y being the fraction of renewable energy. This of course ignores the state component, but I think a simple start could be to fit such a model to the state you are interested in. The code for such an approach could look like this:

import matplotlib
matplotlib.use("agg")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import linregress
import numpy as np

def fracRenewable(df):
    return np.sum(df.loc[df["Category"] == "Renewable fuels", "amount"]/np.sum(df["amount"]))


# load in data

data = pd.read_csv("./energy_data.csv")

# convert data to tidy format and rename columns
molten = pd.melt(data, id_vars=["State", "Fuel_Type", "Category"])
           .rename(columns={"variable": "year", "value": "amount"})

# calculate fraction of renewable fuel per year
grouped = molten.groupby(["year"]).apply(fracRenewable)
                                  .reset_index()
                                  .rename(columns={0: "amount"})
grouped["year"] = grouped["year"].astype(int)

# >>> grouped
#    year    amount
# 0  2009  0.029338
# 1  2010  0.029207
# 2  2011  0.032219
# 3  2012  0.053738
# 4  2013  0.061332
# 5  2014  0.066198
# 6  2015  0.069404
# 7  2016  0.066531
# 8  2017  0.074625
# 9  2018  0.077445

# fit linear model
slope, intercept, r_value, p_value, std_err = linregress(grouped["year"], grouped["amount"])

# plot result
f, ax = plt.subplots()
sbn.scatterplot(x="year", y="amount", ax=ax, data=grouped)
ax.plot(range(2009, 2030), [i*slope + intercept for i in range(2009, 2030)], color="red")
ax.set_title("Renewable fuels (simple predicion)")
ax.set(ylabel="Fraction renewable fuel")
f.savefig("test11.png", bbox_inches="tight")

This gives you a (very simple) model to predict the fraction of renewable fuels at a given year.

If you want to refine the model further, I think a good start could be to group states together based on how similar they are (either based on prior knowledge or a clustering approach) and then do the predictions on those groups.

Answer 2

Yes you can use linear regression for forecasting. There are different ways of how to use linear regression for forecasting. You can

fit a line to the training data and extrapolate that fitted line into the future, this is sometimes also called the drift method;
reduce the problem to a tabular regression problem , splitting the time series into fixed length windows and stacking them on top of each other and then use linear regression;
use other common trend methods .

Here's what (1) and (2) looks like with sktime (disclaimer: I'm one of the developers):

import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import smape_loss
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.utils.plotting.forecasting import plot_ys
from sktime.forecasting.compose import ReducedRegressionForecaster
from sklearn.linear_model import LinearRegression

y = load_airline()  # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)  

# here I forecast all observations of the test series, 
# in your case you could only select the years you're interested in
fh = np.arange(1, len(y_test) + 1)  

# option 1
forecaster = PolynomialTrendForecaster(degree=1)
forecaster.fit(y_train)
y_pred_1 = forecaster.predict(fh)

# option 2
forecaster = ReducedRegressionForecaster(LinearRegression(), window_length=10)
forecaster.fit(y_train)
y_pred_2 = forecaster.predict(fh)

Using Linear Regression for Yearly distributed Time Series Data to get predictions after -N- years

Question

2 answers

solution1
1 ACCPTED 2020-06-10 14:13:51

solution2
1 2020-06-10 15:09:49

Using Linear Regression for Yearly distributed Time Series Data to get predictions after -N- years

Question

2 answers

solution1 1 ACCPTED 2020-06-10 14:13:51

solution2 1 2020-06-10 15:09:49

solution1
1 ACCPTED 2020-06-10 14:13:51

solution2
1 2020-06-10 15:09:49