Currently, I have the following raw data in redshift .
timestamp ,lead
==================================
"2008-04-09 10:02:01.000000",true
"2008-04-09 10:03:05.000000",true
"2008-04-09 10:31:07.000000",true
"2008-04-09 11:00:05.000000",false
...
So, I would like to generate an aggregated data, with interval of 30 mins. My wished outcome is
timestamp ,count
==================================
"2008-04-09 10:00:00.000000",2
"2008-04-09 10:30:00.000000",1
"2008-04-09 11:00:00.000000",0
...
I had referred to https://stackoverflow.com/a/12046382/3238864 , which is valid for PostgreSQL.
I try to mimic the code posted, by using
with thirty_min_intervals as (
select
(select min(timestamp)::date from events) + ( n || ' minutes')::interval start_time,
(select min(timestamp)::date from events) + ((n+30) || ' minutes')::interval end_time
from generate_series(0, (24*60), 30) n
)
select count(CASE WHEN lead THEN 1 END) from events e
right join thirty_min_intervals f
on e.timestamp >= f.start_time and e.timestamp < f.end_time
group by f.start_time, f.end_time
order by f.start_time;
However, I'm getting error
[0A000] ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
May I know, what is the good way to perform aggregation data calculation, of N interval, in redshift .
Joe's answer is a great neat solution. I feel one should always consider how the data is distributed and sorted when you are working in Redshift. It can have a dramatic impact on performance.
Building on Joe's great answer: I will materialise the sample events. In practise the events will be in a table.
drop table if exists public.temporary_events;
create table public.temporary_events AS
select ts::timestamp as ts
,lead
from
( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;
Now run explain:
explain
WITH time_dimension
AS (SELECT dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
SELECT dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM time_dimension td
LEFT JOIN public.temporary_events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
-- ORDER BY 1 Just to simply the job a little
The output is:
XN HashAggregate (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
-> XN Hash Left Join DS_DIST_BOTH (cost=0.05..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
Outer Dist Key: ('2018-11-27 17:00:35'::timestamp without time zone - ((rn.n)::double precision * '00:00:01'::interval))
Inner Dist Key: e."ts"
Hash Cond: ("outer"."?column2?" = "inner"."ts")
-> XN Subquery Scan rn (cost=0.00..14.95 rows=1 width=8)
Filter: (((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) <= '2017-02-16 11:00:00'::timestamp without time zone) AND ((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) >= '2017-02-16 09:30:00'::timestamp without time zone))
-> XN Limit (cost=0.00..1.95 rows=130 width=0)
-> XN Window (cost=0.00..1.95 rows=130 width=0)
-> XN Network (cost=0.00..1.30 rows=130 width=0)
Send to slice 0
-> XN Seq Scan on stl_scan (cost=0.00..1.30 rows=130 width=0)
-> XN Hash (cost=0.04..0.04 rows=4 width=9)
-> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)
Kablamo!
As Joe says you may well use this pattern merrily without issue. However once your data gets sufficiently large OR your SQL logic complex you may want to optimise. If for no other reason you might like to understand the explain plan when you add more sql logic into your code.
Three areas we can look at:
Observe:
drop table if exists public.temporary_time_dimension;
create table public.temporary_time_dimension
distkey(dtm) sortkey(dtm, dtm_half_hour)
AS (SELECT dtm::timestamp as dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
;
drop table if exists public.temporary_events;
create table public.temporary_events
distkey(ts) sortkey(ts)
AS
select ts::timestamp as ts
,lead
from
( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;
explain
SELECT
dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM public.temporary_time_dimension td
LEFT JOIN public.temporary_events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
--order by dtm_half_hour
This then gives:
XN HashAggregate (cost=1512.67..1512.68 rows=1 width=9)
-> XN Merge Left Join DS_DIST_NONE (cost=0.00..1504.26 rows=1682 width=9)
Merge Cond: ("outer".dtm = "inner"."ts")
-> XN Seq Scan on temporary_time_dimension td (cost=0.00..1500.00 rows=1682 width=16)
Filter: ((dtm_half_hour <= '2017-02-16 11:00:00'::timestamp without time zone) AND (dtm_half_hour >= '2017-02-16 09:30:00'::timestamp without time zone))
-> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)
Important caveats:
You can use ROW_NUMBER()
to generate a series. I use internal tables that I know to be large. FWIW, I would typically persist the time_dimension
to a real table to avoid doing this repeatedly.
Here you go:
WITH events
AS ( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
,time_dimension
AS (SELECT dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
SELECT dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM time_dimension td
LEFT JOIN events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
ORDER BY 1
;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.