简体   繁体   中英

How do I forecast a time series with multiple companies in R?

I have a dataframe that spans across 5 years, with ~500 companies and several fundamental stats (eg sales, # employees, ROA). Here's an example of how this could look like. Note, all numbers are just completely randomly picked, apart form the Year, obviously.

Name Year Sales Size ROA
Firm A 2020 857 12000 0.45
Firm B 2020 112 3500 0.32
Firm C 2020 666 7000 0.44
Firm A 2019 860 12000 0.47
Firm B 2019 150 3000 0.31
Firm C 2019 700 6000 0.44
... ... ... ... ...
Firm A 2015 560 10000 0.47
Firm B 2015 100 2000 0.31
Firm C 2015 300 4000 0.44

How would you suggest I try to forecast the 2021 ROA for each firm , taking the span of 5 years (2015 - 2020) into consideration? I tried toying around with the forecast package. However, I haven't found a way to do a bulk action for all firms. My hope would be to end up with something like this:

Name Year predicted ROA
Firm A 2021 0.50
Firm B 2021 0.35
Firm C 2021 0.43

I'd be super grateful for any leads!

I like to use mgcv::gam for forecasting.
I used the simplest possible model where ROA only depends on the Name and a smooth function of the Year.
You'll want to increase k, depending on how much data you have (default is 10).
The by variable is used to split the model by Name.

df <- structure(list(Name = c("Firm A", "Firm B", "Firm C", "Firm A", 
                        "Firm B", "Firm C", "Firm A", "Firm B", "Firm C"), 
                     Year = c(2020L, 2020L, 2020L, 2019L, 2019L, 2019L, 2015L, 2015L, 2015L), 
                     Sales = c(857L, 112L, 666L, 860L, 150L, 700L, 560L, 100L, 300L), 
                     Size = c(12000L, 3500L, 7000L, 12000L, 3000L, 6000L, 10000L, 2000L, 4000L), 
                     ROA = c(0.45, 0.32, 0.44, 0.47, 0.31, 0.44, 0.47, 0.31, 0.44)), 
                row.names = c(NA, -9L), class = "data.frame")
gamfit <- mgcv::gam(formula = ROA ~ Name + s(Year, k = 3, by = as.factor(Name)), data = df)
summary(gamfit)
predict_df <- data.frame(Name = sort(unique(df$Name)), 
                         Year = 2021L)
predict_df$ROA <- predict(gamfit, newdata = predict_df)
predict_df
    Name Year       ROA
1 Firm A 2021 0.3969841
2 Firm B 2021 0.4098413
3 Firm C 2021 0.4055556

The fable package was designed for this sort of thing. Here is an artificial example that mimics the data structure in the question.

library(tidyverse)
library(fable)
# Synthetic data
df <- tibble(
  Name = rep(paste("Firm",c("A","B","C")),6),
  Year = rep(2015:2020, rep(3,6)),
  ROA = runif(18)
)
# Turn it into a tsibble object
df_ts <- df %>%
  as_tsibble(index=Year, key=Name)
# Forecast each firm
fc <- df_ts %>%
  model(ARIMA(ROA)) %>%
  forecast(h=1)
fc
#> # A fable: 3 x 5 [1Y]
#> # Key:     Name, .model [3]
#>   Name   .model      Year           ROA .mean
#>   <chr>  <chr>      <dbl>        <dist> <dbl>
#> 1 Firm A ARIMA(ROA)  2021 N(0.52, 0.14) 0.517
#> 2 Firm B ARIMA(ROA)  2021 N(0.59, 0.07) 0.587
#> 3 Firm C ARIMA(ROA)  2021 N(0.52, 0.11) 0.522

Created on 2021-10-26 by the reprex package (v2.0.1)

Here I have used an ARIMA model, but many other models could be used instead. See my textbook at https://OTexts.com/fpp3 for many examples using fable with ARIMA and other models.

Actually there are tons of possibilities how to do this. The following solution of mine might be slightly overkill and not the ideal way to predict your problem, but is a mere representation of a scalable model workflow for time series prediction.

Check out the code below, if it gives you some interesting results and let me know. Once you got used to the tidymodels stack and the modeltime framework, this kind of data will become easy to process.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(modeltime))
suppressPackageStartupMessages(library(modeltime.ensemble))


#### DATA

h = 1

data <- data.frame(
  id = rep(paste("Firm",c("A","B","C")),6),
  date = rep(2015:2020, rep(3,6)),
  value = runif(18)
)

data <- data %>% 
  pivot_wider(names_from = id, values_from = value)

data <- reshape2::melt(data, id.var='date')
dates <- ymd("2015-01-01")+ years(0:5)
dates <- rep(dates,3)
data$date <- dates
names(data)[2] = "id"

data <- data %>%
  group_by(id) %>%
  future_frame(
    .length_out = h,
    .bind_data  = TRUE) %>%
  ungroup() %>% 
  as_tibble() 

# training- and test set
data_splits <- time_series_split(data, assess = "1 year", cumulative = TRUE)



#### PREDICT

model_fit_glmnet <- linear_reg(penalty = 1) %>%
  set_engine("glmnet") %>%
  fit(value ~ ., data = training(data_splits))

model_fit_xgboost <- boost_tree("regression",  learn_rate = 0.35) %>%
  set_engine("xgboost") %>%
  fit(value ~ ., data = training(data_splits))

ensemble <- modeltime_table(
  model_fit_glmnet,
  model_fit_xgboost
) %>%
  ensemble_weighted(loadings = c(4, 6)) 

model_tbl <- modeltime_table(ensemble)

forecast <-
  model_tbl %>%
  modeltime_forecast(
    new_data    = testing(data_splits),
    actual_data = data,
    keep_data = T
  ) %>%
  group_by(id)  


# change layout
forecast <- forecast %>% filter(str_detect(.key,  "prediction")
)
forecast <- forecast[,c(4,5,6)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM