[英]How to sum on different intervals to find multi year peaks
I am trying to find historical consecutive multi year sales peak of items. 我正在尝试查找历史上连续多年的商品销售高峰。 My problem is that some items were sold in the past and discontinued, but still need to be part of the analysis.
我的问题是,某些商品在过去已经售出并停产,但仍需要作为分析的一部分。 For example:
例如:
I've worked through some for loops in r, however I am unsure how to tackle the issue of summing up the multiple consecutive years and also comparing it against other local maxima within the same dataset. 我已经研究过r中的一些for循环,但是我不确定如何解决连续多年的总和并将其与同一数据集中的其他局部最大值进行比较的问题。
Year Item Sales
2001 Trash Can 100
2002 Trash Can 125
2003 Trash Can 90
2004 Trash Can 97
2002 Red Balloon 23
2003 Red Balloon 309
2004 Red Balloon 67
2005 Red Balloon 8
1998 Blue Bottle 600
1999 Blue Bottle 565
Based on the above data, if I wanted to calculate the 2 year peak of sales, I would want to output Blue Bottle 1165(sum of 1998 and 1999), Red Balloon 376(sum of 2003 and 2004) and Trash Can 225(sum of 2001 and 2002). 根据以上数据,如果我想计算两年的销售高峰,我想输出Blue Bottle 1165(1998和1999年的总和),Red Balloon 376(2003和2004年的总和)和Trash Can 225(总和) 2001年和2002年)。 However, if I wanted a 3 year peak, Blue bottle would be ineligible because it only has 2 years of data.
但是,如果我想要一个3年的峰值,那么Blue瓶将是不合格的,因为它只有2年的数据。
If I wanted to calculate the 3 year peak of sales, I would want to output Red Balloon 399(sum of 2002 to 2004) and Trash Can 315(sum of 2001 to 2003). 如果要计算3年的销售高峰,我想输出Red Balloon 399(2002年至2004年的总和)和Trash Can 315(2001年至2003年的总和)。
In SQL, you can use window functions. 在SQL中,可以使用窗口函数。 For eligible 2 year sales:
对于两年的合格销售:
select item, sales, year
from (select t.*,
sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
row_number() over (partition by item order by year) as seqnum
from t
) t
where seqnum >= 2;
And to get the peak: 并达到顶峰:
select t.*
from (select item, two_year_sales, year,
max(two_year_sales) over (partition by item) as max_two_year_sales
from (select t.*,
sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
row_number() over (partition by item order by year) as seqnum
from t
) t
where seqnum >= 2
) t
where two_year_sales = max_two_year_sales;
A solution in R using the tidyverse
and RcppRoll
: 使用
tidyverse
和RcppRoll
R解决方案:
#Loading the packages and your data as a `tibble`
library("RcppRoll")
library("dplyr")
tbl <- tribble(
~Year, ~Item, ~Sales,
2001, "Trash Can", 100,
2002, "Trash Can", 125,
2003, "Trash Can", 90,
2004, "Trash Can", 97,
2002, "Red Balloon", 23,
2003, "Red Balloon", 309,
2004, "Red Balloon", 67,
2005, "Red Balloon", 8,
1998, "Blue Bottle", 600,
1999, "Blue Bottle", 565
)
# Set the number of consecutive years
n <- 2
# Compute the rolling sums (assumes data to be sorted) and take max
res <- tbl %>%
group_by(Item) %>%
mutate(rollingsum = roll_sumr(Sales, n)) %>%
summarize(best_sum = max(rollingsum, na.rm = TRUE))
print(res)
## A tibble: 3 x 2
# Item best_sum
# <chr> <dbl>
#1 Blue Bottle 1165
#2 Red Balloon 376
#3 Trash Can 225
Setting n <- 3
yields a different res
: 设置
n <- 3
产生不同的res
:
print(res)
## A tibble: 3 x 2
# Item best_sum
# <chr> <dbl>
#1 Blue Bottle -Inf
#2 Red Balloon 399
#3 Trash Can 315
I only can help you with the SQL
part; 我只能为您提供
SQL
部分的帮助; Use GROUP BY
with HAVING
. 将
GROUP BY
与HAVING
。 With HAVIG
it will be filtered out all items without an specified minimum number of historical data-years. 使用
HAVIG
,它将过滤掉所有没有指定最小历史数据年数的项目。
Check if this query adjusts your requirements. 检查此查询是否调整您的要求。
SELECT
item
, count(*) as num_years
, sum(Sales) as local_max
from [your_table]
where year between [year_ini] and [year_end]
group by item
having count(*) >= [number_of_years]
Read the data dat
(shown reproducibly in the Note at the end) into a zoo series with one column per Item
and then convert to a ts series tt
(which will fill in the missing years with NA). 将数据
dat
(末尾的注释中可重复显示)读入一个动物园系列,每个Item
一栏,然后转换为ts系列tt
(将用NA填写缺失的年份)。 Then use rollsumr
to take the sums of every consecutive k
years for each Item
, find the maximum value for each Item
, stack that into a data frame and omit any NA rows. 然后使用
rollsumr
采取每个连续的总和k
年针对每个Item
,找到每个最大值Item
,该堆叠成一个数据帧,并省略任何NA行。 The function Max
is like max(x, na.rm = TRUE)
except that if x is all NAs it returns NA instead of -Inf and does not issue a warning. 函数
Max
类似于max(x, na.rm = TRUE)
不同之处在于如果x是所有NA,它将返回NA而不是-Inf并且不会发出警告。 stack
outputs the item column second so reverse the columns using 2:1 and add nicer names. stack
第二个输出项目列,因此使用2:1反转列并添加更好的名称。
library(zoo)
Max <- function(x) if (all(is.na(x))) NA else max(x, na.rm = TRUE)
peak <- function(data, k) {
tt <- as.ts(read.zoo(data, split = "Item"))
s <- na.omit(stack(apply(rollsumr(tt, k), 2, Max)))
setNames(s[2:1], c("Item", "Sum"))
}
peak(dat, 2)
## Item Sum
## 1 Blue Bottle 1165
## 2 Red Balloon 376
## 3 Trash Can 225
peak(dat, 3)
## Item Sum
## 2 Red Balloon 399
## 3 Trash Can 315
The input in reproducible form is assumed to be: 可复制形式的输入假定为:
dat <-
structure(list(Year = c(2001L, 2002L, 2003L, 2004L, 2002L, 2003L,
2004L, 2005L, 1998L, 1999L), Item = c("Trash Can", "Trash Can",
"Trash Can", "Trash Can", "Red Balloon", "Red Balloon", "Red Balloon",
"Red Balloon", "Blue Bottle", "Blue Bottle"), Sales = c(100L,
125L, 90L, 97L, 23L, 309L, 67L, 8L, 600L, 565L)), row.names = c(NA,
-10L), class = "data.frame")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.