简体   繁体   English

将带有产品标签的数据框转换为时间序列,以便在 R 中进行预测

[英]Convert data frame with product labels to time series for forecasting in R

I am working on a project regarding demand forecasting for pharmaceuticals.我正在做一个关于药品需求预测的项目。

So first of all:所以首先:

The raw import data which used includes information of;使用的原始导入数据包括以下信息; import dates, labels of active pharmaceutical ingredient (Product Code), and quantities in an excel table.进口日期、活性药物成分标签(产品代码)和 excel 表格中的数量。 For example:例如:

Import Date进口日期 Product Code产品代码 Quantity数量
14/09/2018 14/09/2018 1 1 300 300
18/06/2019 18/06/2019 1 1 9400 9400
18/06/2019 18/06/2019 1 1 5430 5430
05/06/2019 05/06/2019 2 2 7000 7000
17/09/2018 17/09/2018 3 3 2300 2300

First of all i need to merge the same dated and same labelled entries, for example, there is only one importation on 18/06/2019 for product labelled as "1".首先,我需要合并相同日期和相同标签的条目,例如,2019 年 6 月 18 日只有一个进口商品被标记为“1”。 Also i need to convert the data frame to time series, sorted by dates and with 'Product Code' as a character and 'Quantity' as numeric.此外,我需要将数据框转换为时间序列,按日期排序,“产品代码”作为字符,“数量”作为数字。 When i try with the following code, the result is the following:当我尝试使用以下代码时,结果如下:

# Pacman, version 0.4.1 is used to organize the packages used in R.

library(pacman)

p_load(readxl, readr, ggplot2, forecast, fpp2, tidyverse, TTR, dplyr, zoo, xts)

dat <- read_xlsx("C://Users/oozgen/Desktop/Albania Statistical Analysis/labelleddata.xlsx", col_names=T)

# We need to format the 'Import Date' column as Date format (Year/Month), and sort by Date

orderdates <- as.Date(dat$`Import Date`)

# Since labeling is done numerically, we need to convert the 'Product Code' column from numeric format to character format in R.

dat <- transform(dat, 'Product Code' = as.character(dat$`Product Code`), 'Import Date' = orderdates)

dat_ts <- as.xts(x = dat[, -1], order.by = dat$Import.Date)

head(dat_ts)
Product.Code产品代码 Quantity数量
2018-01-04 2018-01-04 "784" “784” " 29976" “29976”
2018-01-04 2018-01-04 "1149" “1149” " 200" “200”
2018-01-05 2018-01-05 "306" “306” " 1000" “1000”
2018-01-05 2018-01-05 "713" “713” " 50" “50”
2018-01-05 2018-01-05 "744" “744” " 5040" “5040”
2018-01-05 2018-01-05 "744" “744” " 5040" “5040”

Both 'Product Code' and 'Quantity' are in character format. “产品代码”和“数量”都是字符格式。 Also you might have noticed, it will be difficult to forecast for separate products on a single table.此外,您可能已经注意到,很难预测单个表上的不同产品。 Do i have to separate all products to different tables and forecast them separately?我是否必须将所有产品分开到不同的表格并分别进行预测?

You can access to a sample data set from this link*.您可以从此链接访问样本数据集*。

First Edit, i have worked on the solution which @g-grothendieck shared:第一次编辑,我研究了@g-grothendieck 分享的解决方案:

# Pacman, version 0.4.1 is used to organize the packages used in R.

library(pacman)

p_load(readxl, readr, ggplot2, forecast, fpp3, tidyverse, TTR, dplyr, zoo, xts)

dat <- read_xlsx("C://Users/oozgen/Desktop/Albania Statistical Analysis/labelleddata.xlsx", col_names=T)

# We need to format the 'Import Date' column as Date format (Year/Month), and sort by Date. And also since labeling is done numerically, we need to convert the 'Product Code' column from numeric format to character format in R.

dat <- transform(dat, 'Product Code' = as.character(dat$`Product Code`), 'Import Date' = as.yearmon(dat$`Import Date`, "%Y/%m"))

class(dat)

[1] "data.frame"

My data is tidied up as this, as data.frame:我的数据整理成这样,作为data.frame:

Import.Date导入日期 Product.Code产品代码 Quantity数量
1 1 Sep 2018 2018 年 9 月 1 1 3000 3000
2 2 Mar 2019 2019 年 3 月 1 1 600 600
3 3 Mar 2019 2019 年 3 月 1 1 930 930
4 4 Jun 2019 2019 年 6 月 1 1 2202 2202
5 5 Jun 2019 2019 年 6 月 1 1 5900 5900
6 6 Jun 2019 2019 年 6 月 1 1 5630 5630

and then i continue but get this error:然后我继续但收到此错误:

> x <- dat |>
+   type.convert(as.is = TRUE) |>
+   read.zoo(format = "%Y/%m", split = "Product Code", aggregate = sum) |>
+   as.xts()
Error in `[.data.frame`(rval, , split) : undefined columns selected

An addition to my First Edit (edit v1.1.);对我的第一次编辑的补充(编辑 v1.1。); @g-grothendieck showed, i've made a mistake with a column name, and corrected it. @g-grothendieck 表明,我在列名上犯了一个错误,并更正了它。 And the code follows as this:代码如下:

x <- dat |>
  type.convert(as.is = TRUE) |>
  read.zoo(format = "%Y/%m", split = "Product.Code", aggregate = sum) |>
  as.xts()

and i received the following error:我收到以下错误:

...
+   as.xts()
Error in read.zoo(type.convert(dat, as.is = TRUE), format = "%Y/%m", split = "Product.Code",  : 
  index has 33038 bad entries at data rows: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 ...

An addition to my First Edit's Edit (edit v1.2.);对我的 First Edit 的 Edit (edit v1.2.) 的补充; with the help of @g-grothendieck, the code has finally compiled with 'Product Code's separately, using the zoo package and the as.yearmon command:在@g-grothendieck 的帮助下,代码最终与“产品代码”分开编译,使用 zoo 包和as.yearmon命令:

x <- dat |>
  type.convert(as.is = TRUE) |>
  read.zoo(format = "%b %Y", FUN = as.yearmon,  split = "Product.Code", aggregate = sum) |>
  as.xts()

x[is.na(x)] <- 0

tt <- merge(x, zoo(, seq(start(x), end(x), 1/12))) |>
  na.spline() |>
  as.ts()

and the output data is time series, with a tidier fashion.并且输出数据是时间序列,更整洁。 Also i can manipulate different product group with different frequencies using as.yearqtr (also addition from @g-grothendieck)我也可以使用as.yearqtr操作不同频率的不同产品组(也来自@g-grothendieck)

For the 2nd Edit, i have converted date to month/year and followed the codes which @rob-hyndman presented:对于第二次编辑,我已将日期转换为月/年并遵循@rob-hyndman 提供的代码:

# Pacman, version 0.4.1 is used to organize the packages used in R.

library(pacman)

p_load(readxl, readr, ggplot2, forecast, fpp3, tidyverse, TTR, dplyr, zoo, janitor, xts)

dat <- read_xlsx("C://Users/oozgen/Desktop/Albania Statistical Analysis/labelleddata.xlsx", col_names=T)

# We need to format the 'Import Date' column as Date format (Year/Month), and sort by Date. And also since labeling is done numerically, we need to convert the 'Product Code' column from numeric format to character format in R.

dat <- transform(dat, 'Product Code' = as.character(dat$`Product Code`), 'Import Date' = as.yearmon(dat$`Import Date`, "%Y/%m"))

class(dat)

dat[is.na(dat)]=0

write.csv(dat,"C:/Users/oozgen/Downloads/data2.csv", row.names = FALSE)

x <- readr::read_csv(
  "C:/Users/oozgen/Downloads/data2.csv",
  name_repair = janitor::make_clean_names
) %>%
  mutate(
    # Interpret dates
    import_date = dmy(import_date),
    # Format product codes as character strings with leading zeros
    product_code = sprintf("%03d",product_code)
  ) %>%
  # Make date/code combinations unique
  group_by(import_date, product_code) %>%
  summarise(quantity = sum(quantity), .groups="drop") %>%
  # Create tsibble
  as_tsibble(index=import_date, key=product_code) %>%
  # Fill missing dates with zeros
  fill_gaps(quantity = 0)

and i received this error:我收到了这个错误:

...
+   fill_gaps(quantity = 0)
Rows: 33038 Columns: 3                                                                                                                  
-- Column specification -----------------------------------------------------------------------------
Delimiter: ","
chr (1): import_date
dbl (2): product_code, quantity

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error: Column `import_date` (index) must not contain `NA`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
Problem with `mutate()` column `import_date`.
i `import_date = dmy(import_date)`.
i All formats failed to parse. No formats found. 

that i can't continue on forecasting...我无法继续预测...

For my 2nd Edit's Edit (edit v2.1.);对于我的第二次编辑(编辑 v2.1.); @rob-hyndman explained that the error happens because i've edited the date before using the code, so the function will no longer be able to interpret the dates. @rob-hyndman 解释说,发生错误是因为我在使用代码之前编辑了日期,因此该函数将无法再解释日期。 I've fixed it my code as following:我已经修复了我的代码如下:

dat <- readr::read_csv(
  "C://Users/oozgen/Desktop/Albania Statistical Analysis/labelleddata.csv",
  name_repair = janitor::make_clean_names
) %>%
  mutate(
    # Interpret dates
    import_date = quarter(dmy(import_date), with_year = T),
    # Format product codes as character strings with leading zeros
    product_code = sprintf("%03d", product_code)
  ) %>%
  # Make date/code combinations unique
  group_by(import_date, product_code) %>%
  summarise(quantity = sum(quantity), .groups="drop") %>%
  # Create tsibble
  as_tsibble(index=import_date, key=product_code) %>%
  # Fill missing dates with zeros
  fill_gaps(quantity = 0)

It has compiled without errors, you might have noticed.它编译没有错误,你可能已经注意到了。 I've re-edited this line;我重新编辑了这一行; import_date = quarter(dmy(import_date), with_year = T), because import aren't frequent, most of the pharmaceuticals imported three or four times in year, so i've changed the dates to/as quarterly. import_date = quarter(dmy(import_date), with_year = T),因为进口不频繁,大部分药品一年进口三四次,所以我把日期改为/作为季度。 and then i proceed to forecast as @rob-hyndman suggested:然后我按照@rob-hyndman 的建议继续预测:

fc <- dat %>%
  model(croston = CROSTON(quantity)) %>%
  forecast(h = "12 quarters")
Error: Problem with `mutate()` column `croston`.
i `croston = (function (object, ...) ...`.
x invalid class “Period” object: periods must have integer values
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
195 errors (1 unique) encountered for croston
[195] At least two non-zero values are required to use Croston's method.

And i've received that error... Any methods that i can forecast with quarterly data.我收到了那个错误......我可以用季度数据预测的任何方法。

3rd and the FINAL EDIT:第三次和最终编辑:

With the help of brute force, i've successfully forecasted all of the data at once:在蛮力的帮助下,我一次成功地预测了所有数据:

# Pacman, version 0.4.1 is used to organize the packages used in R.

library(pacman)

p_load(readxl, readr, ggplot2, forecast, fpp3, tidyverse, TTR, tibble, tsibble, tsibbledata, feasts, fable, dplyr, zoo, lubridate, janitor, xts)

dat <- read_xlsx("C://Users/oozgen/Desktop/Albania Statistical Analysis/dat.xlsx", col_names=T)

# We need to format the 'Import Date' column as Date format (Year/Month), and sort by Date. And also since labeling is done numerically, we need to convert the 'Product Code' column from numeric format to character format in R.

dat <- transform(dat, 'Product Code' = as.character(dat$`Product Code`))

class(dat)

z <- dat %>%
  type.convert(as.is = TRUE) %>%
  read.zoo(format = "%Y-%m-%d", FUN = as.yearqtr, index.column = 1,
           split = "Product.Code", aggregate = sum)

tt <- merge(z, zoo(, seq(start(z), end(z), 1/4))) |>
  as.ts()

tt[is.na(tt)]=0

result_matrix <- matrix(, nrow = 20, ncol = 1149)

for(i in 1:1149){
  
  res_i <- croston(tt[, i], h=8)
    res_i <- append(res_i$x, res_i$mean)
      result_matrix[, i] <- res_i
  
  }

write.table(result_matrix, "resultmatrix.csv", sep = ";")

Since croston method for forecasting is valid for intermittent demand.由于croston 预测方法适用于间歇性需求。 I've to organize the data according to infrequently imported pharmaceuticals and frequently imported pharmaceuticals.我必须根据不常进口的药品和经常进口的药品来组织数据。 And for the frequent ones, i should find a suitable method for forecasting.对于频繁出现的,我应该找到合适的预测方法。

This will be the last entry for this post.这将是这篇文章的最后一个条目。 I'll probably post the data set and the related codes and algorithm in github.我大概会把数据集和相关的代码和算法贴在github上。 Many thanks to @sevgib, @g-grothendieck and @rob-hyndman for their help.非常感谢@sevgib、@g-grothendieck 和@rob-hyndman 的帮助。

PS It would be great if there could be a way to use deep learning algorithms for demand forecasting of pharmaceutical products. PS 如果有一种方法可以使用深度学习算法进行医药产品的需求预测,那就太好了。 There was a study which included Facebook AI packages for pandemic forecasting, i think this one*有一项研究包括用于大流行预测的 Facebook AI 包,我认为这个*

Using DF shown in the Note at the end, convert the columns that should be numeric to numeric, use read.zoo using yearmon class for the index, splitting on Product Code (check that the format and column name are specified exactly for your data if different from that assumed in the Note at the end) and aggregating using sum and convert to zoo.使用末尾注释中显示的 DF,将应该是数字的列转换为数字,使用read.zoo使用 yearmon 类作为索引,拆分产品代码(检查是否为您的数据准确指定了格式和列名,如果与最后的注释中假设的不同)并使用 sum 聚合并转换为 zoo。 At the end interpolate by merging z with a zero width series having every month and then filling in the missing values using na.spline (or na.approx or na.locf ) and then convert to ts class since a regularly spaced ts series is needed by many forecasting routines.最后通过将 z 与每个月的零宽度系列合并,然后使用na.spline (或na.approxna.locf )填充缺失值进行na.spline ,然后转换为 ts 类,因为需要定期间隔的ts系列通过许多预测程序。 If you do want an xts series then use as.xts(z) or replace the as.ts() with as.xts() if you still want the interpolation.如果您确实想要 xts 系列,则使用 as.xts(z) 或将as.ts()替换为as.xts()如果您仍然想要插值。 Read ?strptime for the percent codes, ?read.zoo and vignette("zoo-read") for read.zoo , ?yearmon for info on that class and ?type.convert for that function.?strptime的百分比码?read.zoovignette("zoo-read")read.zoo?yearmon对这个类和信息?type.convert该功能。

library(xts) # also pulls in zoo

z <- DF |>
  type.convert(as.is = TRUE) |>
  read.zoo(format = "%d/%m/%Y", FUN = as.yearmon, 
    split = "Product Code", aggregate = sum) 

tt <- merge(z, zoo(, seq(start(z), end(z), 1/12))) |>
  na.spline() |>
  as.ts()

giving:给予:

> tt
                 1    2    3
Sep 2018   300.000 7000 2300
Oct 2018  1914.444 7000 2300
Nov 2018  3528.889 7000 2300
Dec 2018  5143.333 7000 2300
Jan 2019  6757.778 7000 2300
Feb 2019  8372.222 7000 2300
Mar 2019  9986.667 7000 2300
Apr 2019 11601.111 7000 2300
May 2019 13215.556 7000 2300
Jun 2019 14830.000 7000 2300

Note笔记

DF <- structure(list(`Import Date` = c("14/09/2018", "18/06/2019", 
"18/06/2019", "05/06/2019", "17/09/2018"), `Product Code` = c("1", 
"1", "1", "2", "3"), Quantity = c("300", "9400", "5430", "7000", 
"2300")), row.names = c(NA, -5L), class = "data.frame")

The simplest way to do this would be with the fable package, which is designed to handle multiple time series at once, and integrates with the tidyverse collection of packages.最简单的方法是使用 fable 包,它旨在一次处理多个时间序列,并与 tidyverse 包集合集成。 Here is an example using your sample data.这是使用您的示例数据的示例。

library(fpp3)
#> ── Attaching packages ─────────────────────────────────────── fpp3 0.4.0.9000 ──
#> ✓ tibble      3.1.4          ✓ tsibble     1.0.1     
#> ✓ dplyr       1.0.7          ✓ tsibbledata 0.3.0.9000
#> ✓ tidyr       1.1.3          ✓ feasts      0.2.2.9000
#> ✓ lubridate   1.7.10         ✓ fable       0.3.1.9000
#> ✓ ggplot2     3.3.5
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> x lubridate::date()    masks base::date()
#> x dplyr::filter()      masks stats::filter()
#> x tsibble::intersect() masks base::intersect()
#> x tsibble::interval()  masks lubridate::interval()
#> x dplyr::lag()         masks stats::lag()
#> x tsibble::setdiff()   masks base::setdiff()
#> x tsibble::union()     masks base::union()
# Read data and clean up
dat <- readr::read_csv(
    "~/Downloads/albaniaingredient_result.xlsx - Sheet1.csv",
    name_repair = janitor::make_clean_names
  ) %>%
  mutate(
    # Interpret dates
    import_date = dmy(import_date),
    # Format product codes as character strings with leading zeros
    product_code = sprintf("%03d",product_code)
  ) %>%
  # Make date/code combinations unique
  group_by(import_date, product_code) %>%
  summarise(quantity = sum(quantity), .groups="drop") %>%
  # Create tsibble
  as_tsibble(index=import_date, key=product_code) %>%
  # Fill missing dates with zeros
  fill_gaps(quantity = 0)
#> Rows: 3086 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): import_date
#> dbl (2): product_code, quantity
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Produce daily forecasts
fc <- dat %>%
  model(croston = CROSTON(quantity)) %>%
  forecast(h = "14 days")
#> Warning: 16 errors (1 unique) encountered for croston
#> [16] At least two non-zero values are required to use Croston's method.
# forecasts for product 014
fc %>%
  filter(product_code == "014")
#> # A fable: 14 x 5 [1D]
#> # Key:     product_code, .model [1]
#>    product_code .model  import_date quantity .mean
#>    <chr>        <chr>   <date>        <dist> <dbl>
#>  1 014          croston 2020-12-04  4773.584 4774.
#>  2 014          croston 2020-12-05  4773.584 4774.
#>  3 014          croston 2020-12-06  4773.584 4774.
#>  4 014          croston 2020-12-07  4773.584 4774.
#>  5 014          croston 2020-12-08  4773.584 4774.
#>  6 014          croston 2020-12-09  4773.584 4774.
#>  7 014          croston 2020-12-10  4773.584 4774.
#>  8 014          croston 2020-12-11  4773.584 4774.
#>  9 014          croston 2020-12-12  4773.584 4774.
#> 10 014          croston 2020-12-13  4773.584 4774.
#> 11 014          croston 2020-12-14  4773.584 4774.
#> 12 014          croston 2020-12-15  4773.584 4774.
#> 13 014          croston 2020-12-16  4773.584 4774.
#> 14 014          croston 2020-12-17  4773.584 4774.
fc %>%
  filter(product_code == "014") %>%
  autoplot(filter(dat, year(import_date) >= 2020))

Created on 2021-09-23 by the reprex package (v2.0.1)reprex 包(v2.0.1) 于 2021 年 9 月 23 日创建

I've used Croston's method because your data has far too many zeros for most methods.我使用了 Croston 的方法,因为对于大多数方法,您的数据有太多的零。 If you aggregated the data to weekly or monthly, you would then be able to use more reliable methods.如果您将数据聚合到每周或每月,那么您将能够使用更可靠的方法。

A textbook discussing the fable package is freely available online at OTexts.com/fpp3 OTexts.com/fpp3在线免费提供讨论寓言包的教科书

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM