简体   繁体   中英

Is there a way to limit rows in the right table of a left join to only be used once?

I am trying to identify where there was some data loss by comparing two sets of data.

The first set of data contains a truncated non-unique barcode, and a timestamp to the second, which I've found is also not unique. This is stored in a table called restoredData , as this table was created from text backups created every night.

The second set is really two tables, one called items and itemss_archive . They too have non-unique short barcodes and non-unique timestamps.

restoredData has 2,437,910 records, one per item . items has 405,009 and items_archive has 1,589,768, for a total of 1,994,777 rows. So there's at least 443,113 more records in restoredData than there is is in the union of items and items_archive .

However, whenever I try to LEFT JOIN restoredData to the union of items and items_archive , I get 2,437,910 matches, searching for where the LEFT JOIN is null ie where there is no matching record in items + items_archive, I get a count of 0. I've tried joining on barcode, timestamp, and both at the same time with the same results.

This is definitely due to the non-uniqueness I have on all my avaiable keys. But if I was able to only allow a row from my (SELECT t_stamp, barcode FROM items UNION ALL SELECT t_stamp, barcode FROM items_archive) as allItems to only be used ONCE for the join, ie so that it cannot match with multiple things in restoredData , I think it would give me the information I am actually looking for, records that were recorded via text but got lost from the items and items_archive tables.

Is there way to do that in SQL? Or am I going to have to do this programatically with say python, go row by row through restoredData , find a match, and if there is a match delete it so it can't be used again?

Another thing, I know this can't be correctly matching because in my items and items_archive tables, I have a special barcode "NO_READ" which happened during errors reading the barcode, but no such value is found in the entirety of restoredData .

I am using MySQL 5.6.

For reference

restoredData table, 2,437,910 records
barCode (Varchar(13), non-unique), t_stamp (Datetime, non-unique)

items and items_archive table 1,994,777 records total
barCode (Varchar(13), non-unique), t_stamp (Datetime, non-unique)

To give an example, I could have barcode1, timestamp1 appear 4 times in my restoredData and only once in my items + items_archive table, and the result as it stands is this

 restoredData                 items+items_archive
 barcodeCol  t_stampCol       barcode2Col  t_stamp2Col
 barcode1    timestamp1       barcode1     timestamp1             
 barcode1    timestamp1       barcode1     timestamp1             
 barcode1    timestamp1       barcode1     timestamp1             
 barcode1    timestamp1       barcode1     timestamp1             

What I want is this

 restoredData                 items+items_archive
 barcodeCol  t_stampCol       barcode2Col  t_stamp2Col
 barcode1    timestamp1       barcode1     timestamp1             
 barcode1    timestamp1       NULL         NULL             
 barcode1    timestamp1       NULL         NULL             
 barcode1    timestamp1       NULL         NULL

The only way I can think of is to create some temporary tables with indexes and then use the index to create a ranking so that you can use that to create a unique column between the two datasets:-

CREATE TEMPORARY TABLE items_full (t_stamp datetime, barcode varchar(13), idx int NOT NULL AUTO_INCREMENT)

CREATE TEMPORARY TABLE restored_data (t_stamp datetime, barcode varchar(13), idx int NOT NULL AUTO_INCREMENT)

Insert into items_full
SELECT t_stamp, barcode FROM items 
UNION ALL 
SELECT t_stamp, barcode FROM items_archive

Insert into restored_data
SELECT t_stamp, barcode FROM restoreddata


Select t_stamp, barcode, DENSE_RANK() OVER (Partition By barcode, t_stamp order by idx) as myrank from items_full bb

left join 

(select t_stamp, barcode, DENSE_RANK() OVER (Partition By barcode, t_stamp order by idx) as myrank from restored_data) aa 

on bb.t_stamp=aa.t_stamp and bb.barcode=aa.barcode and bb.myrank=aa.myrank

where aa.t_stamp is null

I'd start with counting. Where the count per barcode and timestamp does not match, you'll have to inspect the related records.

select
  r.barcode,
  r.t_stamp,
  r.cnt as recover_count,
  i.cnt as itemtables_count
from
(
  select barcode, t_stamp, count(*) as cnt
  from restoreddata
  group by barcode, t_stamp
) r
left join
(
  select barcode, t_stamp, count(*) as cnt
  from
  (
    select barcode, t_stamp from items
    union all
    select barcode, t_stamp from items_archive
  ) both
  group by barcode, t_stamp
) i on  i.barcode = r.barcode 
    and i.t_stamp = r.t_stamp
    and i.cnt <> r.cnt;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM