简体   繁体   中英

Getting counts for overlapping time periods

I have a table data in PostgreSQL with this structure:

created_at.       customer_email               status
2020-12-31        xxx@gmail.com                opened
...
2020-12-24        yyy@gmail.com                delivered
2020-12-24        xxx@gmail.com                opened
...
2020-12-17        zzz@gmail.com                opened
2020-12-10        xxx@gmail.com                opened
2020-12-03        hhh@gmail.com                enqueued
2020-11-27        xxx@gmail.com                opened
...
2020-11-20        rrr@gmail.com                opened
2020-11-13        ttt@gmail.com                opened

There are many rows for each day.

Basically I need 2021-W01 for this week with the count of unique emails with status "opened" within the last 90 days. Likewise for every week before that.

Desired output:

period    active
2021-W01  1539
2020-W53  1480
2020-W52  1630
2020-W51  1820
2020-W50  1910
2020-W49  1890
2020-W48  2000

How can I do that?

Window functions would come to mind. Alas, those don't allow DISTINCT aggregations.

Instead, get distinct counts from a LATERAL subquery:

WITH weekly_dist AS (
   SELECT DISTINCT date_trunc('week', created_at) AS wk, customer_email
   FROM   tbl
   WHERE  status = 'opened'
   )
SELECT to_char(t.wk, 'YYYY"-W"IW') AS period, ct.active
FROM  (
   SELECT generate_series(date_trunc('week', min(created_at) + interval '1 week')
                        , date_trunc('week', now()::timestamp)
                        , interval '1 week') AS wk   
   FROM   tbl
   ) t
LEFT   JOIN LATERAL (
   SELECT count(DISTINCT customer_email) AS active
   FROM   weekly_dist d
   WHERE  d.wk >= t.wk - interval '91 days'
   AND    d.wk <  t.wk
   ) ct ON true;

db<>fiddle here

I operate with timestamp , not timestamptz , might make a corner case difference.

The CTE weekly_dist reduces the set to distinct "opened" emails. This step is strictly optional, but increases performance significantly if there can be more than a few duplicates per week.

The derived table t generates a timestamp for the begin of each week since the earliest entry in the table up to "now". This way I make sure no week is skipped,even if there are no rows for it. See:

But I do skip the first week since I count active emails before each start of the week.

Then LEFT JOIN LATERAL to a subquery computing the distinct count for the 90-day time-range. To be precise, I deduct 91 days, and exclude the start of the current week. This happens to fall in line with the weekly pre-aggregated data from the CTE. Be wary of that if you shift bounds.

Finally, to_char(t.wk, 'YYYY"-W"IW') is a compact expression to get your desired format for week numbers. Details in the manual here .

You can combine the date_part() function with a group by like this:

SELECT
  DATE_PART('year', created_at)::varchar || '-W' || DATE_PART('week', created_at)::varchar,
  SUM(CASE WHEN status = 'opened' THEN 1 ELSE 0 END)
FROM
  your_table
GROUP BY 1
ORDER BY created_at DESC

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM