I am trying to identify where there was some data loss by comparing two sets of data.
The first set of data contains a truncated non-unique barcode, and a timestamp to the second, which I've found is also not unique. This is stored in a table called restoredData
, as this table was created from text backups created every night.
The second set is really two tables, one called items
and itemss_archive
. They too have non-unique short barcodes and non-unique timestamps.
restoredData
has 2,437,910 records, one per item
. items
has 405,009 and items_archive
has 1,589,768, for a total of 1,994,777 rows. So there's at least 443,113 more records in restoredData
than there is is in the union of items
and items_archive
.
However, whenever I try to LEFT JOIN
restoredData
to the union of items
and items_archive
, I get 2,437,910 matches, searching for where the LEFT JOIN
is null ie where there is no matching record in items + items_archive, I get a count of 0. I've tried joining on barcode, timestamp, and both at the same time with the same results.
This is definitely due to the non-uniqueness I have on all my avaiable keys. But if I was able to only allow a row from my (SELECT t_stamp, barcode FROM items UNION ALL SELECT t_stamp, barcode FROM items_archive) as allItems
to only be used ONCE for the join, ie so that it cannot match with multiple things in restoredData
, I think it would give me the information I am actually looking for, records that were recorded via text but got lost from the items and items_archive tables.
Is there way to do that in SQL? Or am I going to have to do this programatically with say python, go row by row through restoredData
, find a match, and if there is a match delete it so it can't be used again?
Another thing, I know this can't be correctly matching because in my items and items_archive tables, I have a special barcode "NO_READ" which happened during errors reading the barcode, but no such value is found in the entirety of restoredData
.
I am using MySQL 5.6.
For reference
restoredData table, 2,437,910 records
barCode (Varchar(13), non-unique), t_stamp (Datetime, non-unique)
items and items_archive table 1,994,777 records total
barCode (Varchar(13), non-unique), t_stamp (Datetime, non-unique)
To give an example, I could have barcode1, timestamp1 appear 4 times in my restoredData
and only once in my items
+ items_archive
table, and the result as it stands is this
restoredData items+items_archive
barcodeCol t_stampCol barcode2Col t_stamp2Col
barcode1 timestamp1 barcode1 timestamp1
barcode1 timestamp1 barcode1 timestamp1
barcode1 timestamp1 barcode1 timestamp1
barcode1 timestamp1 barcode1 timestamp1
What I want is this
restoredData items+items_archive
barcodeCol t_stampCol barcode2Col t_stamp2Col
barcode1 timestamp1 barcode1 timestamp1
barcode1 timestamp1 NULL NULL
barcode1 timestamp1 NULL NULL
barcode1 timestamp1 NULL NULL
The only way I can think of is to create some temporary tables with indexes and then use the index to create a ranking so that you can use that to create a unique column between the two datasets:-
CREATE TEMPORARY TABLE items_full (t_stamp datetime, barcode varchar(13), idx int NOT NULL AUTO_INCREMENT)
CREATE TEMPORARY TABLE restored_data (t_stamp datetime, barcode varchar(13), idx int NOT NULL AUTO_INCREMENT)
Insert into items_full
SELECT t_stamp, barcode FROM items
UNION ALL
SELECT t_stamp, barcode FROM items_archive
Insert into restored_data
SELECT t_stamp, barcode FROM restoreddata
Select t_stamp, barcode, DENSE_RANK() OVER (Partition By barcode, t_stamp order by idx) as myrank from items_full bb
left join
(select t_stamp, barcode, DENSE_RANK() OVER (Partition By barcode, t_stamp order by idx) as myrank from restored_data) aa
on bb.t_stamp=aa.t_stamp and bb.barcode=aa.barcode and bb.myrank=aa.myrank
where aa.t_stamp is null
I'd start with counting. Where the count per barcode and timestamp does not match, you'll have to inspect the related records.
select
r.barcode,
r.t_stamp,
r.cnt as recover_count,
i.cnt as itemtables_count
from
(
select barcode, t_stamp, count(*) as cnt
from restoreddata
group by barcode, t_stamp
) r
left join
(
select barcode, t_stamp, count(*) as cnt
from
(
select barcode, t_stamp from items
union all
select barcode, t_stamp from items_archive
) both
group by barcode, t_stamp
) i on i.barcode = r.barcode
and i.t_stamp = r.t_stamp
and i.cnt <> r.cnt;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.