使用线性回归对年度分布的时间序列数据进行 -N 年之后的预测

Question

I am stuck with a very unique problem.我遇到了一个非常独特的问题。 I have Time Series Data where the data is given from the years 2009 to 2018. Problem is that I am to answer a very weird question using this data.我有时间序列数据，其中的数据是从 2009 年到 2018 年。问题是我要使用这些数据回答一个非常奇怪的问题。

Data sheets contains the energy generation statistics of each Australian State/Territory in GWh ( Gigawatt hours) for the year 2009 to 2018.数据表包含 2009 年至 2018 年澳大利亚每个州/地区的发电量统计数据，以 GWh（千兆瓦时）为单位。

There are following fields:有以下字段：


State: Names of different Australian states.
Fuel_Type:  The type of fuel which is consumed.
Category:  Determines whether a fuel is considered as a renewable or nonrenewable.
Years: Years which the energy consumptions are recorded.

Problem :问题：

How can I use a linear regression model to predict what percentage of a state X say Victoria’s energy generation will come from y source say Renewable energy sources in the year Z suppose 2100 ?我如何使用线性回归 model 来预测state X说维多利亚的能源发电将来自y source的百分比，比如假设2100 year Z的可再生能源？

How am I suppose to use a Linear Regression Model to solve the problem?我应该如何使用线性回归 Model 来解决问题？ This problem is beyond my reach.这个问题超出了我的能力范围。

Data is from this link数据来自这个链接

Answer 1

I think first you need to think about what your model should look like at the end: You probably want something that relates the dependent variable y (fraction of renewable energy) to your input features.我认为首先您需要考虑您的 model 最后应该是什么样子：您可能想要将因变量y （可再生能源的比例）与您的输入特征相关联的东西。 And one of those features should probably be the year since you are interest in predicting how y changes if you vary this quantity.其中一个特征可能应该是年份，因为如果你改变这个数量，你有兴趣预测y如何变化。 So a very basic linear model could be y = beta1 * x + beta0 with x being the year, beta1 and beta0 being the parameters you want to fit and y being the fraction of renewable energy.因此，一个非常基本的线性 model 可能是y = beta1 * x + beta0 ，其中x是年份， beta1和beta0是您想要拟合的参数， y是可再生能源的比例。 This of course ignores the state component, but I think a simple start could be to fit such a model to the state you are interested in. The code for such an approach could look like this:这当然忽略了 state 组件，但我认为一个简单的开始可能是将这样的 model 安装到 state 您感兴趣的代码如下：

import matplotlib
matplotlib.use("agg")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import linregress
import numpy as np

def fracRenewable(df):
    return np.sum(df.loc[df["Category"] == "Renewable fuels", "amount"]/np.sum(df["amount"]))


# load in data

data = pd.read_csv("./energy_data.csv")

# convert data to tidy format and rename columns
molten = pd.melt(data, id_vars=["State", "Fuel_Type", "Category"])
           .rename(columns={"variable": "year", "value": "amount"})

# calculate fraction of renewable fuel per year
grouped = molten.groupby(["year"]).apply(fracRenewable)
                                  .reset_index()
                                  .rename(columns={0: "amount"})
grouped["year"] = grouped["year"].astype(int)

# >>> grouped
#    year    amount
# 0  2009  0.029338
# 1  2010  0.029207
# 2  2011  0.032219
# 3  2012  0.053738
# 4  2013  0.061332
# 5  2014  0.066198
# 6  2015  0.069404
# 7  2016  0.066531
# 8  2017  0.074625
# 9  2018  0.077445

# fit linear model
slope, intercept, r_value, p_value, std_err = linregress(grouped["year"], grouped["amount"])

# plot result
f, ax = plt.subplots()
sbn.scatterplot(x="year", y="amount", ax=ax, data=grouped)
ax.plot(range(2009, 2030), [i*slope + intercept for i in range(2009, 2030)], color="red")
ax.set_title("Renewable fuels (simple predicion)")
ax.set(ylabel="Fraction renewable fuel")
f.savefig("test11.png", bbox_inches="tight")

This gives you a (very simple) model to predict the fraction of renewable fuels at a given year.这为您提供了一个（非常简单的）model 来预测给定年份的可再生燃料比例。

If you want to refine the model further, I think a good start could be to group states together based on how similar they are (either based on prior knowledge or a clustering approach) and then do the predictions on those groups.如果您想进一步改进 model，我认为一个好的开始可能是根据它们的相似程度（基于先验知识或聚类方法）将状态组合在一起，然后对这些组进行预测。

Answer 2

Yes you can use linear regression for forecasting.是的，您可以使用线性回归进行预测。 There are different ways of how to use linear regression for forecasting.如何使用线性回归进行预测有不同的方法。 You can你可以

fit a line to the training data and extrapolate that fitted line into the future, this is sometimes also called the drift method;将一条线拟合到训练数据并将该拟合线外推到未来，这有时也称为漂移方法；
reduce the problem to a tabular regression problem , splitting the time series into fixed length windows and stacking them on top of each other and then use linear regression; 将问题简化为表格回归问题，将时间序列拆分为固定长度 windows 并将它们堆叠在一起，然后使用线性回归；
use other common trend methods .使用其他常用的趋势方法。

Here's what (1) and (2) looks like with sktime (disclaimer: I'm one of the developers):以下是sktime的 (1) 和 (2) 的样子（免责声明：我是开发人员之一）：

import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import smape_loss
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.utils.plotting.forecasting import plot_ys
from sktime.forecasting.compose import ReducedRegressionForecaster
from sklearn.linear_model import LinearRegression

y = load_airline()  # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)  

# here I forecast all observations of the test series, 
# in your case you could only select the years you're interested in
fh = np.arange(1, len(y_test) + 1)  

# option 1
forecaster = PolynomialTrendForecaster(degree=1)
forecaster.fit(y_train)
y_pred_1 = forecaster.predict(fh)

# option 2
forecaster = ReducedRegressionForecaster(LinearRegression(), window_length=10)
forecaster.fit(y_train)
y_pred_2 = forecaster.predict(fh)

使用线性回归对年度分布的时间序列数据进行 -N 年之后的预测

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-10 14:13:51

解决方案2
1 2020-06-10 15:09:49

使用线性回归对年度分布的时间序列数据进行 -N 年之后的预测

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-10 14:13:51

解决方案2 1 2020-06-10 15:09:49

解决方案1
1 已采纳 2020-06-10 14:13:51

解决方案2
1 2020-06-10 15:09:49