简体   繁体   English

SQL:查找重复项的数量,添加的新值以及在同一个表中删除的值(动态)

[英]SQL: find count of duplicates, new values added, and values removed in the same table (dynamically)

I'm hoping to complete the goals below using SQL : 我希望使用SQL完成以下目标:

1) Find # of duplicated records 1)找到重复记录的数量
Extract number of repeated values based on a column, which is a "snapshot date", comparing that against previous date 根据列(即“快照日期”)提取重复值的数量,并将其与上一个日期进行比较
2) Find # of records added 2)查找添加的记录数
3) Find # of records removed 3)查找已删除的记录数

See sample tables below: 见下面的样本表:

Current Table 当前表

snapshot_date | unique ID
 2018-08-15        1
 2018-08-15        2
 2018-08-15        3
 2018-08-15        4
 2018-08-15        5

 2018-08-16        1
 2018-08-16        3
 2018-08-16        4
 2018-08-16        6
 2018-08-16        7
 2018-08-16        8
 2018-08-16        9

 2018-08-17        3
 2018-08-17        8
 2018-08-17        10
 2018-08-17        11
 2018-08-17        12
 2018-08-17        13

Desired Table 所需的表

snapshot date | count | # of dupe from previous date | sum of ID added | sum of ID removed
 2018-08-15       5                 N/A                     N/A                  N/A 
 2018-08-16       7                  3                       4                    2
 2018-08-17       6                  2                       4                    5

If anyone knows the script to get to the desired table, I'd be so appreciative! 如果有人知道脚本到达所需的表格,我会非常感激! Thank ya'll in advance! 提前谢谢你!

If you are using MySQL, which, at least in earlier versions, does not support the analytic functions LEAD and LAG, then one approach would be to do a series of self joins followed by an aggregation to get results you want: 如果你使用MySQL,至少在早期版本中,它不支持分析函数LEAD和LAG,那么一种方法是进行一系列自连接,然后进行聚合以获得所需的结果:

SELECT
    t1.snapshot_date,
    t1.count,
    t1.previous_dupe,
    t1.num_added,
    t2.num_subtracted
FROM
(
    SELECT
        t1.snapshot_date,
        COUNT(*) AS count,
        COUNT(t2.snapshot_date) AS previous_dupe,
        COUNT(CASE WHEN t2.snapshot_date IS NULL THEN 1 END) AS num_added
    FROM yourTable t1
    LEFT JOIN yourTable t2
        ON t1.snapshot_date = DATE_ADD(t2.snapshot_date, INTERVAL 1 DAY) AND
           t1.uniqueID = t2.uniqueID
    GROUP BY t1.snapshot_date
) t1
LEFT JOIN
(
    SELECT
        DATE_ADD(t1.snapshot_date, INTERVAL 1 DAY) AS snapshot_date,
        COUNT(CASE WHEN t2.snapshot_date IS NULL THEN 1 END) AS num_subtracted
    FROM yourTable t1
    LEFT JOIN yourTable t2
        ON t1.snapshot_date = DATE_SUB(t2.snapshot_date, INTERVAL 1 DAY) AND
           t1.uniqueID = t2.uniqueID
    GROUP BY t1.snapshot_date
) t2
    ON t1.snapshot_date = t2.snapshot_date;

在此输入图像描述

Demo 演示

Notes: There is a slight discrepancy between my results and what you expect, partly due to your own math error, and partly due to the way the logic in the query works. 注意:我的结果与您的期望之间存在轻微差异,部分原因是您自己的数学错误,部分原因是查询中的逻辑工作方式。 I report 5 new IDs being added in the earliest record, because conceptually there was no earlier record, and all 5 values are techincally new. 我报告在最早的记录中添加了5个新ID,因为从概念上讲,没有先前的记录,并且所有5个值都是技术新的。

This problem was particularly ugly because we needed to self join twice, in two separate subqueries, in different directions. 这个问题特别难看,因为我们需要在两个独立的子查询中以不同的方向自我连接两次。

this is my take. 这是我的看法。 based on SQL Server 基于SQL Server

SELECT  snapshot_date       = COALESCE(c.snapshot_date, DATEADD(day, 1, p.snapshot_date)),
        [count]             = COUNT(c.snapshot_date),
        dup_from_prev_day   = SUM(CASE WHEN c.snapshot_date is not null 
                                       AND  p.snapshot_date is not null 
                                       THEN 1 END),
        sum_of_id_added     = SUM(CASE WHEN c.snapshot_date is not null 
                                       AND  p.snapshot_date is null 
                                       THEN 1 END),
        sum_of_id_removed   = SUM(CASE WHEN c.snapshot_date is null 
                                       AND  p.snapshot_date is not null 
                                       THEN 1 END)
FROM    yourTable c         -- current
        FULL OUTER JOIN yourTable p -- previous
        ON  c.snapshot_date     = DATEADD(DAY, 1, p.snapshot_date)
        AND c.uniqueID          = p.uniqueID
GROUP BY COALESCE(c.snapshot_date, DATEADD(DAY, 1, p.snapshot_date))
HAVING COUNT(c.snapshot_date) > 0

/* RESULT : 
snapshot_date  count  dup_from_prev_day  sum_of_id_added  sum_of_id_removed
2018-08-15     5      NULL               5                NULL
2018-08-16     7      3                  4                2
2018-08-17     6      2                  4                5
*/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM