How do I forecast a time series with multiple companies in R?

Question

I have a dataframe that spans across 5 years, with ~500 companies and several fundamental stats (eg sales, # employees, ROA). Here's an example of how this could look like. Note, all numbers are just completely randomly picked, apart form the Year, obviously.

Name	Year	Sales	Size	ROA
Firm A	2020	857	12000	0.45
Firm B	2020	112	3500	0.32
Firm C	2020	666	7000	0.44
Firm A	2019	860	12000	0.47
Firm B	2019	150	3000	0.31
Firm C	2019	700	6000	0.44
...	...	...	...	...
Firm A	2015	560	10000	0.47
Firm B	2015	100	2000	0.31
Firm C	2015	300	4000	0.44

How would you suggest I try to forecast the 2021 ROA for each firm , taking the span of 5 years (2015 - 2020) into consideration? I tried toying around with the forecast package. However, I haven't found a way to do a bulk action for all firms. My hope would be to end up with something like this:

Name	Year	predicted ROA
Firm A	2021	0.50
Firm B	2021	0.35
Firm C	2021	0.43

I'd be super grateful for any leads!

Answer 1

I like to use mgcv::gam for forecasting.
I used the simplest possible model where ROA only depends on the Name and a smooth function of the Year.
You'll want to increase k, depending on how much data you have (default is 10).
The by variable is used to split the model by Name.

df <- structure(list(Name = c("Firm A", "Firm B", "Firm C", "Firm A", 
                        "Firm B", "Firm C", "Firm A", "Firm B", "Firm C"), 
                     Year = c(2020L, 2020L, 2020L, 2019L, 2019L, 2019L, 2015L, 2015L, 2015L), 
                     Sales = c(857L, 112L, 666L, 860L, 150L, 700L, 560L, 100L, 300L), 
                     Size = c(12000L, 3500L, 7000L, 12000L, 3000L, 6000L, 10000L, 2000L, 4000L), 
                     ROA = c(0.45, 0.32, 0.44, 0.47, 0.31, 0.44, 0.47, 0.31, 0.44)), 
                row.names = c(NA, -9L), class = "data.frame")
gamfit <- mgcv::gam(formula = ROA ~ Name + s(Year, k = 3, by = as.factor(Name)), data = df)
summary(gamfit)
predict_df <- data.frame(Name = sort(unique(df$Name)), 
                         Year = 2021L)
predict_df$ROA <- predict(gamfit, newdata = predict_df)

predict_df
    Name Year       ROA
1 Firm A 2021 0.3969841
2 Firm B 2021 0.4098413
3 Firm C 2021 0.4055556

Answer 2

The fable package was designed for this sort of thing. Here is an artificial example that mimics the data structure in the question.

library(tidyverse)
library(fable)
# Synthetic data
df <- tibble(
  Name = rep(paste("Firm",c("A","B","C")),6),
  Year = rep(2015:2020, rep(3,6)),
  ROA = runif(18)
)
# Turn it into a tsibble object
df_ts <- df %>%
  as_tsibble(index=Year, key=Name)
# Forecast each firm
fc <- df_ts %>%
  model(ARIMA(ROA)) %>%
  forecast(h=1)
fc
#> # A fable: 3 x 5 [1Y]
#> # Key:     Name, .model [3]
#>   Name   .model      Year           ROA .mean
#>   <chr>  <chr>      <dbl>        <dist> <dbl>
#> 1 Firm A ARIMA(ROA)  2021 N(0.52, 0.14) 0.517
#> 2 Firm B ARIMA(ROA)  2021 N(0.59, 0.07) 0.587
#> 3 Firm C ARIMA(ROA)  2021 N(0.52, 0.11) 0.522

^{Created on 2021-10-26 by the reprex package (v2.0.1)}

Here I have used an ARIMA model, but many other models could be used instead. See my textbook at https://OTexts.com/fpp3 for many examples using fable with ARIMA and other models.

Answer 3

Actually there are tons of possibilities how to do this. The following solution of mine might be slightly overkill and not the ideal way to predict your problem, but is a mere representation of a scalable model workflow for time series prediction.

Check out the code below, if it gives you some interesting results and let me know. Once you got used to the tidymodels stack and the modeltime framework, this kind of data will become easy to process.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(modeltime))
suppressPackageStartupMessages(library(modeltime.ensemble))


#### DATA

h = 1

data <- data.frame(
  id = rep(paste("Firm",c("A","B","C")),6),
  date = rep(2015:2020, rep(3,6)),
  value = runif(18)
)

data <- data %>% 
  pivot_wider(names_from = id, values_from = value)

data <- reshape2::melt(data, id.var='date')
dates <- ymd("2015-01-01")+ years(0:5)
dates <- rep(dates,3)
data$date <- dates
names(data)[2] = "id"

data <- data %>%
  group_by(id) %>%
  future_frame(
    .length_out = h,
    .bind_data  = TRUE) %>%
  ungroup() %>% 
  as_tibble() 

# training- and test set
data_splits <- time_series_split(data, assess = "1 year", cumulative = TRUE)



#### PREDICT

model_fit_glmnet <- linear_reg(penalty = 1) %>%
  set_engine("glmnet") %>%
  fit(value ~ ., data = training(data_splits))

model_fit_xgboost <- boost_tree("regression",  learn_rate = 0.35) %>%
  set_engine("xgboost") %>%
  fit(value ~ ., data = training(data_splits))

ensemble <- modeltime_table(
  model_fit_glmnet,
  model_fit_xgboost
) %>%
  ensemble_weighted(loadings = c(4, 6)) 

model_tbl <- modeltime_table(ensemble)

forecast <-
  model_tbl %>%
  modeltime_forecast(
    new_data    = testing(data_splits),
    actual_data = data,
    keep_data = T
  ) %>%
  group_by(id)  


# change layout
forecast <- forecast %>% filter(str_detect(.key,  "prediction")
)
forecast <- forecast[,c(4,5,6)]

How do I forecast a time series with multiple companies in R?

Question

3 answers

solution1
0 2021-10-23 23:32:23

solution2
0 2021-10-25 23:25:17

solution3
0 2021-11-03 17:02:50

How do I forecast a time series with multiple companies in R?

Question

3 answers

solution1 0 2021-10-23 23:32:23

solution2 0 2021-10-25 23:25:17

solution3 0 2021-11-03 17:02:50

solution1
0 2021-10-23 23:32:23

solution2
0 2021-10-25 23:25:17

solution3
0 2021-11-03 17:02:50