简体   繁体   中英

Calculating time difference in redshift

I have a table table_a :

event_id        event_start                 process_id          process_start                name            country
A1              2020-07-01 21:19:01         B1                  2020-07-01 21:20:05          google          US
A1              2020-07-01 21:19:01         B2                  2020-07-01 21:21:01          google          US
A1              2020-07-01 21:19:01         B3                  2020-07-01 21:23:04          google          US
A4              2020-07-01 14:59:12         C1                  2020-07-01 15:01:14          bing            UK
A5              2020-07-01 12:39:14         D1                  2020-07-01 12:49:13          bing            CA
A6              2020-07-01 11:49:46         E1                  2020-07-01 11:52:59          facebook        US

In this table I have event_id which can be the same, if event_id is the same, then event_start will be the same too. process_id is unique, process_start can have duplicate. I am trying to calculate the minute difference between event_start and process_start for each event_id , the problem is that event_start has the same time but the process_start can have different timestamps. I would like to take 2 times from the process_start if they have more than one. First would be the earliest (min) process_start and the latest (max) process_start so that my desired output would look like so:

event_id        event_start                 process_id          process_start                name            country        earliest_diff_minute                                latest_diff_minute
A1              2020-07-01 21:19:01         B1                  2020-07-01 21:20:05          google          US             1 (2020-07-01 21:20:05 - 2020-07-01 21:19:01)       3 (2020-07-01 21:23:05 - 2020-07-01 21:19:04)
A1              2020-07-01 21:19:01         B3                  2020-07-01 21:23:04          google          US             1 (2020-07-01 21:20:05 - 2020-07-01 21:19:01)       3 (2020-07-01 21:23:05 - 2020-07-01 21:19:04)
A4              2020-07-01 14:59:12         C1                  2020-07-01 15:01:14          bing            UK             2 ( 2020-07-01 15:01:14 - 2020-07-01 14:59:12)      2 ( 2020-07-01 15:01:14 - 2020-07-01 14:59:12) 
A5              2020-07-01 12:39:14         D1                  2020-07-01 12:49:13          bing            CA             10                                                  10
A6              2020-07-01 11:49:46         E1                  2020-07-01 11:52:59          facebook        US             3                                                   3

So if the process_id is unique, the min and max time difference will be the same. If more than 1, both max and min values are recorded while everything in between is discarded.

I assume that duplicates is by name and country . You can just use window functions, particularly min() and max() to get the earliest and latest process dates for each grouping:

select a.*,
       datediff('m', event_start, max(process_start) over (partition by event_id, name, country)),
       datediff('m', event_start, min(process_start) over (partition by event_id, name, country))
from table_a a

I think there are various ways to achieve your goal, this is the first that I thought

SELECT event_id,
       event_start,
       process_id,
       process_start,
       name,
       country,
       datediff('m', event_start, first_process_start) as earliest_diff_minute,
       datediff('m', last_process_start, event_start) as latest_diff_minute
FROM (
SELECT event_id,
       event_start,
       process_id,
       process_start,
       name,
       country
       first_value(process_start) 
         over (partition by event_id 
               order by process_start 
               rows between unbounded preceding and unbounded following) as first_process_start,
       last_value(process_start) 
         over (partition by event_id 
               order by process_start 
               rows between unbounded preceding and unbounded following) as last_process_start
FROM my_schema.my_table) as a
WHERE process_start = first_process_start
OR process_start = last_process_start

Basically in the sub query for each row you retrieve the first and last process_start. Then you retrieve only the rows that have the process_start equal to one of each and after you can calculate the date difference.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM