简体   繁体   中英

Calculate a 3-month moving average from non-aggregated data

I have a bunch of orders. Each order is either a type A or type B order. I want a 3-month moving average of time it takes to ship orders of each type. How can I aggregate this order data into what I want using Redshift or Postgres SQL?

Start with this:

order_id order_type ship_date time_to_ship
1 a 2021-12-25 100
2 b 2021-12-31 110
3 a 2022-01-01 200
4 a 2022-01-01 50
5 b 2022-01-15 110
6 a 2022-02-02 100
7 a 2022-02-28 300
8 b 2022-04-05 75
9 b 2022-04-06 210
10 a 2022-04-15 150

Note: Some months have no shipments. The solution should allow for this.

I want this:

order_type ship__month mma3_time_to_ship
a 2022-02-01 150
a 2022-04-01 160
b 2022-04-01 126.25

Where a 3-month moving average is only calculated for months with at least 2 preceding months. Each record is an order type-month. The ship_month columns denotes the month of shipment (Redshift represents months as the date of the first of the month).

Here's how the mma3_time_to_ship column is calculated, expressed as Excel-like formulas:

150 = AVERAGE(100, 200, 50, 100, 300) <- The average for all A orders in Dec, Jan, and Feb.

160 = AVERAGE(200, 50, 100, 300, 150) <- The average for all A orders in Jan, Feb, Apr (no orders in March)

126.25 = AVERAGE(110, 110, 75, 210) <- The average for all B orders in Dec, Jan, Apr (no B orders in Feb, no orders at all in Mar)

My attempt doesn't aggregate it into monthly data and 3-month averages (this query runs without error in Redshift):

SELECT
  order_type,
  DATE_TRUNC('month', ship_date) AS ship_month,
  AVG(time_to_ship) OVER (
    PARTITION BY
      order_type,
      ship_month
    ORDER BY ship_date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS avg_time_to_ship
FROM tbl

Is what I want possible?

This is honestly a complete stab in the dark, so it won't surprise me if it's not correct... but it seems to me you can accomplish this with a self join using a range of dates within the join.

select
  t1.order_type, t1.ship_date, avg (t2.time_to_ship) as 3mma_time_to_ship
from
  tbl t1
  join tbl t2 on
    t1.order_type = t2.order_type and
    t2.ship_date between t1.ship_date - interval '3 months' and t1.ship_date
group by
  t1.order_type, t1.ship_date

The results don't match your example, but then I'm not entirely sure where they came from anyway.

Perhaps this will be the catalyst towards an eventual solution or at least an idea to start.

This is Pg12, by the way. Not sure if it will work on Redshift.

-- EDIT --

Per your updates, I was able to match your three results identically. I used dense_rank to find the closest three months:

with foo as (
  select
    order_type, date_trunc ('month', ship_date)::date as ship_month,
    time_to_ship, dense_rank() over (partition by order_type order by date_trunc ('month', ship_date)) as dr
  from tbl
)
select
  f1.order_type, f1.ship_month,
  avg (f2.time_to_ship),
  array_agg (f2.time_to_ship)
from
  foo f1
  join foo f2 on
    f1.order_type = f2.order_type and
    f2.dr between f1.dr - 2 and f1.dr
group by
  f1.order_type, f1.ship_month

Results:

b   2022-01-01  110.0000000000000000    {110,110}
a   2022-01-01  116.6666666666666667    {100,50,200,100,50,200}
b   2022-04-01  126.2500000000000000    {110,110,75,210,110,110,75,210}
b   2021-12-01  110.0000000000000000    {110}
a   2021-12-01  100.0000000000000000    {100}
a   2022-02-01  150.0000000000000000    {100,50,200,100,300,100,50,200,100,300}
a   2022-04-01  160.0000000000000000    {50,200,100,300,150}

There are some dupes in the array elements, but it doesn't seem to impact the averages. I'm sure that part could be fixed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM