[英]SQL: find count of duplicates, new values added, and values removed in the same table (dynamically)
I'm hoping to complete the goals below using SQL : 我希望使用SQL完成以下目标:
1) Find # of duplicated records 1)找到重复记录的数量
Extract number of repeated values based on a column, which is a "snapshot date", comparing that against previous date 根据列(即“快照日期”)提取重复值的数量,并将其与上一个日期进行比较
2) Find # of records added 2)查找添加的记录数
3) Find # of records removed 3)查找已删除的记录数
Current Table 当前表
snapshot_date | unique ID
2018-08-15 1
2018-08-15 2
2018-08-15 3
2018-08-15 4
2018-08-15 5
2018-08-16 1
2018-08-16 3
2018-08-16 4
2018-08-16 6
2018-08-16 7
2018-08-16 8
2018-08-16 9
2018-08-17 3
2018-08-17 8
2018-08-17 10
2018-08-17 11
2018-08-17 12
2018-08-17 13
Desired Table 所需的表
snapshot date | count | # of dupe from previous date | sum of ID added | sum of ID removed
2018-08-15 5 N/A N/A N/A
2018-08-16 7 3 4 2
2018-08-17 6 2 4 5
If anyone knows the script to get to the desired table, I'd be so appreciative! 如果有人知道脚本到达所需的表格,我会非常感激! Thank ya'll in advance!
提前谢谢你!
If you are using MySQL, which, at least in earlier versions, does not support the analytic functions LEAD and LAG, then one approach would be to do a series of self joins followed by an aggregation to get results you want: 如果你使用MySQL,至少在早期版本中,它不支持分析函数LEAD和LAG,那么一种方法是进行一系列自连接,然后进行聚合以获得所需的结果:
SELECT
t1.snapshot_date,
t1.count,
t1.previous_dupe,
t1.num_added,
t2.num_subtracted
FROM
(
SELECT
t1.snapshot_date,
COUNT(*) AS count,
COUNT(t2.snapshot_date) AS previous_dupe,
COUNT(CASE WHEN t2.snapshot_date IS NULL THEN 1 END) AS num_added
FROM yourTable t1
LEFT JOIN yourTable t2
ON t1.snapshot_date = DATE_ADD(t2.snapshot_date, INTERVAL 1 DAY) AND
t1.uniqueID = t2.uniqueID
GROUP BY t1.snapshot_date
) t1
LEFT JOIN
(
SELECT
DATE_ADD(t1.snapshot_date, INTERVAL 1 DAY) AS snapshot_date,
COUNT(CASE WHEN t2.snapshot_date IS NULL THEN 1 END) AS num_subtracted
FROM yourTable t1
LEFT JOIN yourTable t2
ON t1.snapshot_date = DATE_SUB(t2.snapshot_date, INTERVAL 1 DAY) AND
t1.uniqueID = t2.uniqueID
GROUP BY t1.snapshot_date
) t2
ON t1.snapshot_date = t2.snapshot_date;
Notes: There is a slight discrepancy between my results and what you expect, partly due to your own math error, and partly due to the way the logic in the query works. 注意:我的结果与您的期望之间存在轻微差异,部分原因是您自己的数学错误,部分原因是查询中的逻辑工作方式。 I report 5 new IDs being added in the earliest record, because conceptually there was no earlier record, and all 5 values are techincally new.
我报告在最早的记录中添加了5个新ID,因为从概念上讲,没有先前的记录,并且所有5个值都是技术新的。
This problem was particularly ugly because we needed to self join twice, in two separate subqueries, in different directions. 这个问题特别难看,因为我们需要在两个独立的子查询中以不同的方向自我连接两次。
this is my take. 这是我的看法。 based on SQL Server
基于SQL Server
SELECT snapshot_date = COALESCE(c.snapshot_date, DATEADD(day, 1, p.snapshot_date)),
[count] = COUNT(c.snapshot_date),
dup_from_prev_day = SUM(CASE WHEN c.snapshot_date is not null
AND p.snapshot_date is not null
THEN 1 END),
sum_of_id_added = SUM(CASE WHEN c.snapshot_date is not null
AND p.snapshot_date is null
THEN 1 END),
sum_of_id_removed = SUM(CASE WHEN c.snapshot_date is null
AND p.snapshot_date is not null
THEN 1 END)
FROM yourTable c -- current
FULL OUTER JOIN yourTable p -- previous
ON c.snapshot_date = DATEADD(DAY, 1, p.snapshot_date)
AND c.uniqueID = p.uniqueID
GROUP BY COALESCE(c.snapshot_date, DATEADD(DAY, 1, p.snapshot_date))
HAVING COUNT(c.snapshot_date) > 0
/* RESULT :
snapshot_date count dup_from_prev_day sum_of_id_added sum_of_id_removed
2018-08-15 5 NULL 5 NULL
2018-08-16 7 3 4 2
2018-08-17 6 2 4 5
*/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.