使用线性回归对年度分布的时间序列数据进行 -N 年之后的预测

Question

我遇到了一个非常独特的问题。 我有时间序列数据，其中的数据是从 2009 年到 2018 年。问题是我要使用这些数据回答一个非常奇怪的问题。

数据表包含 2009 年至 2018 年澳大利亚每个州/地区的发电量统计数据，以 GWh（千兆瓦时）为单位。

有以下字段：


State: Names of different Australian states.
Fuel_Type:  The type of fuel which is consumed.
Category:  Determines whether a fuel is considered as a renewable or nonrenewable.
Years: Years which the energy consumptions are recorded.

问题：

我如何使用线性回归 model 来预测state X说维多利亚的能源发电将来自y source的百分比，比如假设2100 year Z的可再生能源？

我应该如何使用线性回归 Model 来解决问题？ 这个问题超出了我的能力范围。

数据来自这个链接

Answer 1

我认为首先您需要考虑您的 model 最后应该是什么样子：您可能想要将因变量y （可再生能源的比例）与您的输入特征相关联的东西。 其中一个特征可能应该是年份，因为如果你改变这个数量，你有兴趣预测y如何变化。 因此，一个非常基本的线性 model 可能是y = beta1 * x + beta0 ，其中x是年份， beta1和beta0是您想要拟合的参数， y是可再生能源的比例。 这当然忽略了 state 组件，但我认为一个简单的开始可能是将这样的 model 安装到 state 您感兴趣的代码如下：

import matplotlib
matplotlib.use("agg")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import linregress
import numpy as np

def fracRenewable(df):
    return np.sum(df.loc[df["Category"] == "Renewable fuels", "amount"]/np.sum(df["amount"]))


# load in data

data = pd.read_csv("./energy_data.csv")

# convert data to tidy format and rename columns
molten = pd.melt(data, id_vars=["State", "Fuel_Type", "Category"])
           .rename(columns={"variable": "year", "value": "amount"})

# calculate fraction of renewable fuel per year
grouped = molten.groupby(["year"]).apply(fracRenewable)
                                  .reset_index()
                                  .rename(columns={0: "amount"})
grouped["year"] = grouped["year"].astype(int)

# >>> grouped
#    year    amount
# 0  2009  0.029338
# 1  2010  0.029207
# 2  2011  0.032219
# 3  2012  0.053738
# 4  2013  0.061332
# 5  2014  0.066198
# 6  2015  0.069404
# 7  2016  0.066531
# 8  2017  0.074625
# 9  2018  0.077445

# fit linear model
slope, intercept, r_value, p_value, std_err = linregress(grouped["year"], grouped["amount"])

# plot result
f, ax = plt.subplots()
sbn.scatterplot(x="year", y="amount", ax=ax, data=grouped)
ax.plot(range(2009, 2030), [i*slope + intercept for i in range(2009, 2030)], color="red")
ax.set_title("Renewable fuels (simple predicion)")
ax.set(ylabel="Fraction renewable fuel")
f.savefig("test11.png", bbox_inches="tight")

这为您提供了一个（非常简单的）model 来预测给定年份的可再生燃料比例。

如果您想进一步改进 model，我认为一个好的开始可能是根据它们的相似程度（基于先验知识或聚类方法）将状态组合在一起，然后对这些组进行预测。

Answer 2

是的，您可以使用线性回归进行预测。 如何使用线性回归进行预测有不同的方法。 你可以

将一条线拟合到训练数据并将该拟合线外推到未来，这有时也称为漂移方法；
将问题简化为表格回归问题，将时间序列拆分为固定长度 windows 并将它们堆叠在一起，然后使用线性回归；
使用其他常用的趋势方法。

以下是sktime的 (1) 和 (2) 的样子（免责声明：我是开发人员之一）：

import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import smape_loss
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.utils.plotting.forecasting import plot_ys
from sktime.forecasting.compose import ReducedRegressionForecaster
from sklearn.linear_model import LinearRegression

y = load_airline()  # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)  

# here I forecast all observations of the test series, 
# in your case you could only select the years you're interested in
fh = np.arange(1, len(y_test) + 1)  

# option 1
forecaster = PolynomialTrendForecaster(degree=1)
forecaster.fit(y_train)
y_pred_1 = forecaster.predict(fh)

# option 2
forecaster = ReducedRegressionForecaster(LinearRegression(), window_length=10)
forecaster.fit(y_train)
y_pred_2 = forecaster.predict(fh)

使用线性回归对年度分布的时间序列数据进行 -N 年之后的预测

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-10 14:13:51

解决方案2
1 2020-06-10 15:09:49

使用线性回归对年度分布的时间序列数据进行 -N 年之后的预测

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-10 14:13:51

解决方案2 1 2020-06-10 15:09:49

解决方案1
1 已采纳 2020-06-10 14:13:51

解决方案2
1 2020-06-10 15:09:49