I have a dataframe that spans across 5 years, with ~500 companies and several fundamental stats (eg sales, # employees, ROA). Here's an example of how this could look like. Note, all numbers are just completely randomly picked, apart form the Year, obviously.
Name | Year | Sales | Size | ROA |
---|---|---|---|---|
Firm A | 2020 | 857 | 12000 | 0.45 |
Firm B | 2020 | 112 | 3500 | 0.32 |
Firm C | 2020 | 666 | 7000 | 0.44 |
Firm A | 2019 | 860 | 12000 | 0.47 |
Firm B | 2019 | 150 | 3000 | 0.31 |
Firm C | 2019 | 700 | 6000 | 0.44 |
... | ... | ... | ... | ... |
Firm A | 2015 | 560 | 10000 | 0.47 |
Firm B | 2015 | 100 | 2000 | 0.31 |
Firm C | 2015 | 300 | 4000 | 0.44 |
How would you suggest I try to forecast the 2021 ROA for each firm , taking the span of 5 years (2015 - 2020) into consideration? I tried toying around with the forecast
package. However, I haven't found a way to do a bulk action for all firms. My hope would be to end up with something like this:
Name | Year | predicted ROA |
---|---|---|
Firm A | 2021 | 0.50 |
Firm B | 2021 | 0.35 |
Firm C | 2021 | 0.43 |
I'd be super grateful for any leads!
I like to use mgcv::gam for forecasting.
I used the simplest possible model where ROA only depends on the Name and a smooth function of the Year.
You'll want to increase k, depending on how much data you have (default is 10).
The by
variable is used to split the model by Name.
df <- structure(list(Name = c("Firm A", "Firm B", "Firm C", "Firm A",
"Firm B", "Firm C", "Firm A", "Firm B", "Firm C"),
Year = c(2020L, 2020L, 2020L, 2019L, 2019L, 2019L, 2015L, 2015L, 2015L),
Sales = c(857L, 112L, 666L, 860L, 150L, 700L, 560L, 100L, 300L),
Size = c(12000L, 3500L, 7000L, 12000L, 3000L, 6000L, 10000L, 2000L, 4000L),
ROA = c(0.45, 0.32, 0.44, 0.47, 0.31, 0.44, 0.47, 0.31, 0.44)),
row.names = c(NA, -9L), class = "data.frame")
gamfit <- mgcv::gam(formula = ROA ~ Name + s(Year, k = 3, by = as.factor(Name)), data = df)
summary(gamfit)
predict_df <- data.frame(Name = sort(unique(df$Name)),
Year = 2021L)
predict_df$ROA <- predict(gamfit, newdata = predict_df)
predict_df
Name Year ROA
1 Firm A 2021 0.3969841
2 Firm B 2021 0.4098413
3 Firm C 2021 0.4055556
The fable package was designed for this sort of thing. Here is an artificial example that mimics the data structure in the question.
library(tidyverse)
library(fable)
# Synthetic data
df <- tibble(
Name = rep(paste("Firm",c("A","B","C")),6),
Year = rep(2015:2020, rep(3,6)),
ROA = runif(18)
)
# Turn it into a tsibble object
df_ts <- df %>%
as_tsibble(index=Year, key=Name)
# Forecast each firm
fc <- df_ts %>%
model(ARIMA(ROA)) %>%
forecast(h=1)
fc
#> # A fable: 3 x 5 [1Y]
#> # Key: Name, .model [3]
#> Name .model Year ROA .mean
#> <chr> <chr> <dbl> <dist> <dbl>
#> 1 Firm A ARIMA(ROA) 2021 N(0.52, 0.14) 0.517
#> 2 Firm B ARIMA(ROA) 2021 N(0.59, 0.07) 0.587
#> 3 Firm C ARIMA(ROA) 2021 N(0.52, 0.11) 0.522
Created on 2021-10-26 by the reprex package (v2.0.1)
Here I have used an ARIMA model, but many other models could be used instead. See my textbook at https://OTexts.com/fpp3 for many examples using fable with ARIMA and other models.
Actually there are tons of possibilities how to do this. The following solution of mine might be slightly overkill and not the ideal way to predict your problem, but is a mere representation of a scalable model workflow for time series prediction.
Check out the code below, if it gives you some interesting results and let me know. Once you got used to the tidymodels stack and the modeltime framework, this kind of data will become easy to process.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(modeltime))
suppressPackageStartupMessages(library(modeltime.ensemble))
#### DATA
h = 1
data <- data.frame(
id = rep(paste("Firm",c("A","B","C")),6),
date = rep(2015:2020, rep(3,6)),
value = runif(18)
)
data <- data %>%
pivot_wider(names_from = id, values_from = value)
data <- reshape2::melt(data, id.var='date')
dates <- ymd("2015-01-01")+ years(0:5)
dates <- rep(dates,3)
data$date <- dates
names(data)[2] = "id"
data <- data %>%
group_by(id) %>%
future_frame(
.length_out = h,
.bind_data = TRUE) %>%
ungroup() %>%
as_tibble()
# training- and test set
data_splits <- time_series_split(data, assess = "1 year", cumulative = TRUE)
#### PREDICT
model_fit_glmnet <- linear_reg(penalty = 1) %>%
set_engine("glmnet") %>%
fit(value ~ ., data = training(data_splits))
model_fit_xgboost <- boost_tree("regression", learn_rate = 0.35) %>%
set_engine("xgboost") %>%
fit(value ~ ., data = training(data_splits))
ensemble <- modeltime_table(
model_fit_glmnet,
model_fit_xgboost
) %>%
ensemble_weighted(loadings = c(4, 6))
model_tbl <- modeltime_table(ensemble)
forecast <-
model_tbl %>%
modeltime_forecast(
new_data = testing(data_splits),
actual_data = data,
keep_data = T
) %>%
group_by(id)
# change layout
forecast <- forecast %>% filter(str_detect(.key, "prediction")
)
forecast <- forecast[,c(4,5,6)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.