使用線性回歸對年度分布的時間序列數據進行 -N 年之后的預測

Question

我遇到了一個非常獨特的問題。 我有時間序列數據，其中的數據是從 2009 年到 2018 年。問題是我要使用這些數據回答一個非常奇怪的問題。

數據表包含 2009 年至 2018 年澳大利亞每個州/地區的發電量統計數據，以 GWh（千兆瓦時）為單位。

有以下字段：


State: Names of different Australian states.
Fuel_Type:  The type of fuel which is consumed.
Category:  Determines whether a fuel is considered as a renewable or nonrenewable.
Years: Years which the energy consumptions are recorded.

問題：

我如何使用線性回歸 model 來預測state X說維多利亞的能源發電將來自y source的百分比，比如假設2100 year Z的可再生能源？

我應該如何使用線性回歸 Model 來解決問題？ 這個問題超出了我的能力范圍。

數據來自這個鏈接

Answer 1

我認為首先您需要考慮您的 model 最后應該是什么樣子：您可能想要將因變量y （可再生能源的比例）與您的輸入特征相關聯的東西。 其中一個特征可能應該是年份，因為如果你改變這個數量，你有興趣預測y如何變化。 因此，一個非常基本的線性 model 可能是y = beta1 * x + beta0 ，其中x是年份， beta1和beta0是您想要擬合的參數， y是可再生能源的比例。 這當然忽略了 state 組件，但我認為一個簡單的開始可能是將這樣的 model 安裝到 state 您感興趣的代碼如下：

import matplotlib
matplotlib.use("agg")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import linregress
import numpy as np

def fracRenewable(df):
    return np.sum(df.loc[df["Category"] == "Renewable fuels", "amount"]/np.sum(df["amount"]))


# load in data

data = pd.read_csv("./energy_data.csv")

# convert data to tidy format and rename columns
molten = pd.melt(data, id_vars=["State", "Fuel_Type", "Category"])
           .rename(columns={"variable": "year", "value": "amount"})

# calculate fraction of renewable fuel per year
grouped = molten.groupby(["year"]).apply(fracRenewable)
                                  .reset_index()
                                  .rename(columns={0: "amount"})
grouped["year"] = grouped["year"].astype(int)

# >>> grouped
#    year    amount
# 0  2009  0.029338
# 1  2010  0.029207
# 2  2011  0.032219
# 3  2012  0.053738
# 4  2013  0.061332
# 5  2014  0.066198
# 6  2015  0.069404
# 7  2016  0.066531
# 8  2017  0.074625
# 9  2018  0.077445

# fit linear model
slope, intercept, r_value, p_value, std_err = linregress(grouped["year"], grouped["amount"])

# plot result
f, ax = plt.subplots()
sbn.scatterplot(x="year", y="amount", ax=ax, data=grouped)
ax.plot(range(2009, 2030), [i*slope + intercept for i in range(2009, 2030)], color="red")
ax.set_title("Renewable fuels (simple predicion)")
ax.set(ylabel="Fraction renewable fuel")
f.savefig("test11.png", bbox_inches="tight")

這為您提供了一個（非常簡單的）model 來預測給定年份的可再生燃料比例。

如果您想進一步改進 model，我認為一個好的開始可能是根據它們的相似程度（基於先驗知識或聚類方法）將狀態組合在一起，然后對這些組進行預測。

Answer 2

是的，您可以使用線性回歸進行預測。 如何使用線性回歸進行預測有不同的方法。 你可以

將一條線擬合到訓練數據並將該擬合線外推到未來，這有時也稱為漂移方法；
將問題簡化為表格回歸問題，將時間序列拆分為固定長度 windows 並將它們堆疊在一起，然后使用線性回歸；
使用其他常用的趨勢方法。

以下是sktime的 (1) 和 (2) 的樣子（免責聲明：我是開發人員之一）：

import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import smape_loss
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.utils.plotting.forecasting import plot_ys
from sktime.forecasting.compose import ReducedRegressionForecaster
from sklearn.linear_model import LinearRegression

y = load_airline()  # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)  

# here I forecast all observations of the test series, 
# in your case you could only select the years you're interested in
fh = np.arange(1, len(y_test) + 1)  

# option 1
forecaster = PolynomialTrendForecaster(degree=1)
forecaster.fit(y_train)
y_pred_1 = forecaster.predict(fh)

# option 2
forecaster = ReducedRegressionForecaster(LinearRegression(), window_length=10)
forecaster.fit(y_train)
y_pred_2 = forecaster.predict(fh)

使用線性回歸對年度分布的時間序列數據進行 -N 年之后的預測

問題描述

2 個解決方案

解決方案1
1 已采納 2020-06-10 14:13:51

解決方案2
1 2020-06-10 15:09:49

使用線性回歸對年度分布的時間序列數據進行 -N 年之后的預測

問題描述

2 個解決方案

解決方案1 1 已采納 2020-06-10 14:13:51

解決方案2 1 2020-06-10 15:09:49

解決方案1
1 已采納 2020-06-10 14:13:51

解決方案2
1 2020-06-10 15:09:49