Redshift GROUP BY time interval

Question

Currently, I have the following raw data in redshift .

timestamp                   ,lead
==================================
"2008-04-09 10:02:01.000000",true
"2008-04-09 10:03:05.000000",true
"2008-04-09 10:31:07.000000",true
"2008-04-09 11:00:05.000000",false
...

So, I would like to generate an aggregated data, with interval of 30 mins. My wished outcome is

timestamp                   ,count
==================================
"2008-04-09 10:00:00.000000",2
"2008-04-09 10:30:00.000000",1
"2008-04-09 11:00:00.000000",0
...

I had referred to https://stackoverflow.com/a/12046382/3238864 , which is valid for PostgreSQL.

I try to mimic the code posted, by using

with thirty_min_intervals as (
    select
      (select min(timestamp)::date from events) + ( n    || ' minutes')::interval start_time,
      (select min(timestamp)::date from events) + ((n+30) || ' minutes')::interval end_time
    from generate_series(0, (24*60), 30) n
)
select count(CASE WHEN lead THEN 1 END) from events e
right join thirty_min_intervals f
on e.timestamp >= f.start_time and e.timestamp < f.end_time
group by f.start_time, f.end_time
order by f.start_time;

However, I'm getting error

[0A000] ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.

May I know, what is the good way to perform aggregation data calculation, of N interval, in redshift .

Answer 1

Joe's answer is a great neat solution. I feel one should always consider how the data is distributed and sorted when you are working in Redshift. It can have a dramatic impact on performance.

Building on Joe's great answer: I will materialise the sample events. In practise the events will be in a table.

drop table if exists public.temporary_events;
create table public.temporary_events AS 
select ts::timestamp as ts 
    ,lead 
from 
(   SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;

Now run explain:

explain 
WITH time_dimension
AS (SELECT  dtm
           ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
    FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
         /*  NB: `sysdate` could be substituted for an arbitrary ending timestamp */
         (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
          FROM /* Generate a number sequence of 100,000 values from a large internal table */
               (SELECT  ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)

SELECT dtm_half_hour
      ,COUNT(CASE WHEN lead THEN 1 END)
FROM      time_dimension td
LEFT JOIN public.temporary_events e
       ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
-- ORDER BY 1 Just to simply the job a little

The output is:

XN HashAggregate  (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
  ->  XN Hash Left Join DS_DIST_BOTH  (cost=0.05..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
        Outer Dist Key: ('2018-11-27 17:00:35'::timestamp without time zone - ((rn.n)::double precision * '00:00:01'::interval))
        Inner Dist Key: e."ts"
        Hash Cond: ("outer"."?column2?" = "inner"."ts")
        ->  XN Subquery Scan rn  (cost=0.00..14.95 rows=1 width=8)
              Filter: (((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) <= '2017-02-16 11:00:00'::timestamp without time zone) AND ((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) >= '2017-02-16 09:30:00'::timestamp without time zone))
              ->  XN Limit  (cost=0.00..1.95 rows=130 width=0)
                    ->  XN Window  (cost=0.00..1.95 rows=130 width=0)
                          ->  XN Network  (cost=0.00..1.30 rows=130 width=0)
                                Send to slice 0
                                ->  XN Seq Scan on stl_scan  (cost=0.00..1.30 rows=130 width=0)
        ->  XN Hash  (cost=0.04..0.04 rows=4 width=9)
              ->  XN Seq Scan on temporary_events e  (cost=0.00..0.04 rows=4 width=9)

Kablamo!

As Joe says you may well use this pattern merrily without issue. However once your data gets sufficiently large OR your SQL logic complex you may want to optimise. If for no other reason you might like to understand the explain plan when you add more sql logic into your code.

Three areas we can look at:

The Join. Make the join between both sets of data work on the same datatype. Here we join a timestamp to an interval.
Data distribution. Materialise and distribute both tables by timestamp.
Data sorting. If events is sorted by this timestamp and the time dimension is sorted by both timestamps then you may be able to complete the entire query using a merge join without any data moving and without sending the data to the leader node for aggregation.

Observe:

drop table if exists public.temporary_time_dimension;
create table public.temporary_time_dimension
distkey(dtm) sortkey(dtm, dtm_half_hour)
AS (SELECT  dtm::timestamp as dtm
           ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
    FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
         /*  NB: `sysdate` could be substituted for an arbitrary ending timestamp */
         (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm         
          FROM /* Generate a number sequence of 100,000 values from a large internal table */
               (SELECT  ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
;               

drop table if exists public.temporary_events;
create table public.temporary_events 
distkey(ts) sortkey(ts)
AS 
select ts::timestamp as ts 
    ,lead 
from 
(   SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;

explain 
SELECT 
     dtm_half_hour
    ,COUNT(CASE WHEN lead THEN 1 END)
FROM public.temporary_time_dimension td
LEFT JOIN public.temporary_events e
       ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
--order by dtm_half_hour

This then gives:

XN HashAggregate  (cost=1512.67..1512.68 rows=1 width=9)
  ->  XN Merge Left Join DS_DIST_NONE  (cost=0.00..1504.26 rows=1682 width=9)
        Merge Cond: ("outer".dtm = "inner"."ts")
        ->  XN Seq Scan on temporary_time_dimension td  (cost=0.00..1500.00 rows=1682 width=16)
              Filter: ((dtm_half_hour <= '2017-02-16 11:00:00'::timestamp without time zone) AND (dtm_half_hour >= '2017-02-16 09:30:00'::timestamp without time zone))
        ->  XN Seq Scan on temporary_events e  (cost=0.00..0.04 rows=4 width=9)

Important caveats:

I've taken the order by out. putting it back in will result in the data being sent to the leader node for sorting. If you can do away with the sort then do away with the sort!
I'm certain that choosing timestamp as the events table sort key will NOT be ideal in many situations. I just thought I'd show what is possible.
I think you will likely want to have a time dimension created with diststyle all and sorted. This will ensure that your joins do not generate network traffic.

Answer 2

You can use ROW_NUMBER() to generate a series. I use internal tables that I know to be large. FWIW, I would typically persist the time_dimension to a real table to avoid doing this repeatedly.

Here you go:

WITH events
AS (          SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)

,time_dimension
AS (SELECT  dtm
           ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
    FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
         /*  NB: `sysdate` could be substituted for an arbitrary ending timestamp */
         (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
          FROM /* Generate a number sequence of 100,000 values from a large internal table */
               (SELECT  ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)

SELECT dtm_half_hour
      ,COUNT(CASE WHEN lead THEN 1 END)
FROM      time_dimension td
LEFT JOIN events e
       ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
ORDER BY 1
;

Redshift GROUP BY time interval

Question

2 answers

solution1
2 2018-11-27 17:30:23

solution2
1 ACCPTED 2017-02-16 21:01:07

Redshift GROUP BY time interval

Question

2 answers

solution1 2 2018-11-27 17:30:23

solution2 1 ACCPTED 2017-02-16 21:01:07

solution1
2 2018-11-27 17:30:23

solution2
1 ACCPTED 2017-02-16 21:01:07