I'm hoping to complete the goals below using SQL :
1) Find # of duplicated records
Extract number of repeated values based on a column, which is a "snapshot date", comparing that against previous date
2) Find # of records added
3) Find # of records removed
Current Table
snapshot_date | unique ID
2018-08-15 1
2018-08-15 2
2018-08-15 3
2018-08-15 4
2018-08-15 5
2018-08-16 1
2018-08-16 3
2018-08-16 4
2018-08-16 6
2018-08-16 7
2018-08-16 8
2018-08-16 9
2018-08-17 3
2018-08-17 8
2018-08-17 10
2018-08-17 11
2018-08-17 12
2018-08-17 13
Desired Table
snapshot date | count | # of dupe from previous date | sum of ID added | sum of ID removed
2018-08-15 5 N/A N/A N/A
2018-08-16 7 3 4 2
2018-08-17 6 2 4 5
If anyone knows the script to get to the desired table, I'd be so appreciative! Thank ya'll in advance!
If you are using MySQL, which, at least in earlier versions, does not support the analytic functions LEAD and LAG, then one approach would be to do a series of self joins followed by an aggregation to get results you want:
SELECT
t1.snapshot_date,
t1.count,
t1.previous_dupe,
t1.num_added,
t2.num_subtracted
FROM
(
SELECT
t1.snapshot_date,
COUNT(*) AS count,
COUNT(t2.snapshot_date) AS previous_dupe,
COUNT(CASE WHEN t2.snapshot_date IS NULL THEN 1 END) AS num_added
FROM yourTable t1
LEFT JOIN yourTable t2
ON t1.snapshot_date = DATE_ADD(t2.snapshot_date, INTERVAL 1 DAY) AND
t1.uniqueID = t2.uniqueID
GROUP BY t1.snapshot_date
) t1
LEFT JOIN
(
SELECT
DATE_ADD(t1.snapshot_date, INTERVAL 1 DAY) AS snapshot_date,
COUNT(CASE WHEN t2.snapshot_date IS NULL THEN 1 END) AS num_subtracted
FROM yourTable t1
LEFT JOIN yourTable t2
ON t1.snapshot_date = DATE_SUB(t2.snapshot_date, INTERVAL 1 DAY) AND
t1.uniqueID = t2.uniqueID
GROUP BY t1.snapshot_date
) t2
ON t1.snapshot_date = t2.snapshot_date;
Notes: There is a slight discrepancy between my results and what you expect, partly due to your own math error, and partly due to the way the logic in the query works. I report 5 new IDs being added in the earliest record, because conceptually there was no earlier record, and all 5 values are techincally new.
This problem was particularly ugly because we needed to self join twice, in two separate subqueries, in different directions.
this is my take. based on SQL Server
SELECT snapshot_date = COALESCE(c.snapshot_date, DATEADD(day, 1, p.snapshot_date)),
[count] = COUNT(c.snapshot_date),
dup_from_prev_day = SUM(CASE WHEN c.snapshot_date is not null
AND p.snapshot_date is not null
THEN 1 END),
sum_of_id_added = SUM(CASE WHEN c.snapshot_date is not null
AND p.snapshot_date is null
THEN 1 END),
sum_of_id_removed = SUM(CASE WHEN c.snapshot_date is null
AND p.snapshot_date is not null
THEN 1 END)
FROM yourTable c -- current
FULL OUTER JOIN yourTable p -- previous
ON c.snapshot_date = DATEADD(DAY, 1, p.snapshot_date)
AND c.uniqueID = p.uniqueID
GROUP BY COALESCE(c.snapshot_date, DATEADD(DAY, 1, p.snapshot_date))
HAVING COUNT(c.snapshot_date) > 0
/* RESULT :
snapshot_date count dup_from_prev_day sum_of_id_added sum_of_id_removed
2018-08-15 5 NULL 5 NULL
2018-08-16 7 3 4 2
2018-08-17 6 2 4 5
*/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.