简体   繁体   English

redshift / postgresql-如何查找插入期间1秒钟内出现的重复项?

[英]redshift/postgresql - How to find duplicates that occur within 1 second during insert?

We have a bit of an issue where one of our nodes was logging duplicate events. 我们有一个问题,其中一个节点正在记录重复事件。 We use the following query to insert only unique events, however some of the events were fired within 1 second after the previous event so the following query wouldn't catch it as the date field is different. 我们使用以下查询仅插入唯一事件,但是某些事件在上一个事件之后的1秒内被触发,因此以下查询不会捕获到它,因为日期字段不同。

Can someone help me update this query so it only grabs unique events even if there is 1 second difference? 有人可以帮我更新此查询,以便即使相差1秒也只能捕获唯一事件吗?

INSERT INTO project_events
    SELECT * From (
         SELECT 
                session,
                date, 
                team,
                project,
                event_type,
                event_group,
                event_label,
                event_value,
                event_count,

                ROW_NUMBER() OVER ( PARTITION BY 
                    session,
                    date, 
                    team,
                    project,
                    event_type,
                    event_group,
                    event_label,
                    event_value,
                    event_count
                    ORDER BY date, project ASC 
                ) rownum  
         FROM tmp_table_name where record_type='update'
    ) WHERE rownum = 1;

First of all, in your example, putting the same attributes in PARTITION BY and ORDER BY makes little sense, as the values inside each group will be identical, so your query is equivalent to simply doing SELECT DISTINCT on your PARTITION BY attributes 首先,在您的示例中,将相同的属性放在PARTITION BYORDER BY中几乎没有意义,因为每个组内的值都相同,因此您的查询等同于对PARTITION BY属性进行SELECT DISTINCT

Now, to the real question. 现在,到真正的问题。 How can you know the element is unique? 您怎么知道元素是独特的? Is it based on the combo of all these attributes: session,team,project,event_type,event_group,event_label,event_value,event_count ? 是否基于所有这些属性的组合: session,team,project,event_type,event_group,event_label,event_value,event_count

If so, try this: 如果是这样,请尝试以下操作:

SELECT * FROM 
(
    SELECT 
            session,
            date, 
            team,
            project,
            event_type,
            event_group,
            event_label,
            event_value,
            event_count,
            LAG(date) OVER ( 
              PARTITION BY 
                session,
                team,
                project,
                event_type,
                event_group,
                event_label,
                event_value,
                event_count
              ORDER BY
                date
            ) prev_date 
     FROM tmp_table_name where record_type='update'
 ) sub
 WHERE prev_date IS NULL -- first event
    OR DATEDIFF(second, prev_date, date) > 1  -- events more than 1 second apart  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM