Calculating time difference in redshift

Question

I have a table table_a :

event_id        event_start                 process_id          process_start                name            country
A1              2020-07-01 21:19:01         B1                  2020-07-01 21:20:05          google          US
A1              2020-07-01 21:19:01         B2                  2020-07-01 21:21:01          google          US
A1              2020-07-01 21:19:01         B3                  2020-07-01 21:23:04          google          US
A4              2020-07-01 14:59:12         C1                  2020-07-01 15:01:14          bing            UK
A5              2020-07-01 12:39:14         D1                  2020-07-01 12:49:13          bing            CA
A6              2020-07-01 11:49:46         E1                  2020-07-01 11:52:59          facebook        US

In this table I have event_id which can be the same, if event_id is the same, then event_start will be the same too. process_id is unique, process_start can have duplicate. I am trying to calculate the minute difference between event_start and process_start for each event_id , the problem is that event_start has the same time but the process_start can have different timestamps. I would like to take 2 times from the process_start if they have more than one. First would be the earliest (min) process_start and the latest (max) process_start so that my desired output would look like so:

event_id        event_start                 process_id          process_start                name            country        earliest_diff_minute                                latest_diff_minute
A1              2020-07-01 21:19:01         B1                  2020-07-01 21:20:05          google          US             1 (2020-07-01 21:20:05 - 2020-07-01 21:19:01)       3 (2020-07-01 21:23:05 - 2020-07-01 21:19:04)
A1              2020-07-01 21:19:01         B3                  2020-07-01 21:23:04          google          US             1 (2020-07-01 21:20:05 - 2020-07-01 21:19:01)       3 (2020-07-01 21:23:05 - 2020-07-01 21:19:04)
A4              2020-07-01 14:59:12         C1                  2020-07-01 15:01:14          bing            UK             2 ( 2020-07-01 15:01:14 - 2020-07-01 14:59:12)      2 ( 2020-07-01 15:01:14 - 2020-07-01 14:59:12) 
A5              2020-07-01 12:39:14         D1                  2020-07-01 12:49:13          bing            CA             10                                                  10
A6              2020-07-01 11:49:46         E1                  2020-07-01 11:52:59          facebook        US             3                                                   3

So if the process_id is unique, the min and max time difference will be the same. If more than 1, both max and min values are recorded while everything in between is discarded.

Answer 1

I assume that duplicates is by name and country . You can just use window functions, particularly min() and max() to get the earliest and latest process dates for each grouping:

select a.*,
       datediff('m', event_start, max(process_start) over (partition by event_id, name, country)),
       datediff('m', event_start, min(process_start) over (partition by event_id, name, country))
from table_a a

Answer 2

I think there are various ways to achieve your goal, this is the first that I thought

SELECT event_id,
       event_start,
       process_id,
       process_start,
       name,
       country,
       datediff('m', event_start, first_process_start) as earliest_diff_minute,
       datediff('m', last_process_start, event_start) as latest_diff_minute
FROM (
SELECT event_id,
       event_start,
       process_id,
       process_start,
       name,
       country
       first_value(process_start) 
         over (partition by event_id 
               order by process_start 
               rows between unbounded preceding and unbounded following) as first_process_start,
       last_value(process_start) 
         over (partition by event_id 
               order by process_start 
               rows between unbounded preceding and unbounded following) as last_process_start
FROM my_schema.my_table) as a
WHERE process_start = first_process_start
OR process_start = last_process_start

Basically in the sub query for each row you retrieve the first and last process_start. Then you retrieve only the rows that have the process_start equal to one of each and after you can calculate the date difference.

Calculating time difference in redshift

Question

2 answers

solution1
2 2020-07-02 12:09:59

solution2
1 2020-07-02 09:35:35

Calculating time difference in redshift

Question

2 answers

solution1 2 2020-07-02 12:09:59

solution2 1 2020-07-02 09:35:35

solution1
2 2020-07-02 12:09:59

solution2
1 2020-07-02 09:35:35