简体   繁体   中英

Find duplicates, then update a table with the id from the main table and then delete records in a table

My Issue is as follows I have two tables the personaldata table consist of records of employees, whoisonboard table consists of records when an employee been onboard. We have got duplicate in the personaldata table and these different ids are also stored in the whoisonboard table when people has been checked in. No problems to find the duplicates.

Delete all data in the personal table that do not exist in whoisonboard table) DELETE FROM personaldata WHERE id NOT IN (SELECT personid FROM whoisonboard)

This will delete any person who has not been on any ships, as there would not be a record in whoisonboard table.

We delete any records in whoisonboard that do not have a corresponding record in personaldata - this is to make sure there are no orphant whoisonboard records

  DELETE FROM whoisonboard WHERE personid NOT IN (SELECT id FROM personaldata)

We can find all the duplicates in the personaldata table and give the whoisonboard, to identify duplicates the query looks for the field names, date_of_birth and nationality is the same.

 select a.id as personid, b.id as whoisid, b.personid whoispersonid, a.names, a.date_of_birth, a.nationality 
 from personaldata a
 join whoisonboard b on a.id = b.personid 
   where  (a.names, a.date_of_birth, a.nationality) in (
     select a.names, a.date_of_birth, a.nationality
      from personaldata a
      group  by a.names, a.date_of_birth, a.nationality
      having count(distinct a.id) > 1
    )
  order by date_of_birth desc

We can then issue this SQL statement to update the records and later delete the orphan records of the duplicates, if we have a lot duplicates it can be time-consuming to do this.

UPDATE whoisonboard SET personid = '74777a8e-343c-11e9-a2bb-000c2912dae9' 
WHERE `id` LIKE '5bd2c268-ec4d-11e8-ab89-000c29045ceb'

Then at the end, I would just delete the orphans records with

DELETE FROM personaldata WHERE id NOT IN (SELECT personid FROM whoisonboard)

I have been trying to build a SQL statement that could do the update in one go, it fails

 update whoisonboard set personid = final_id 
 from whoisonboard 
 join personaldata on personaldata.id = whoisonboard.personid 
 join ( select names, date_of_birth, nationality, min(id) as final_id from 
 personaldata group by names, date_of_birth, nationality ) min_ids on 
 min_ids.names = personaldata.names

I get an error when executing, I wonder if what I trying to do is possible in one sql statement, the thing is that as we try to avoid duplicates they do happen and it would be good to have a simple way to refresh the database.

I just did this to correct a similar problem in my data warehouse.

I'm including much pseudocode because this is lengthy and I don't want to bother testing it for your case. Also, mine was for SQL Server, so the code probably wouldn't work for you. So here is the concept...

Create a temp table to store all natural key code combinations and the ids (many ids per natural key).

create table #p (id [auto_increment], personkey, personid)
insert #p select lastname + ',' + firstname, personid 
from personaldata 
order by 1

Create a temp table to store the minimum id for each natural key value (one id per natural key).

create table #pmin (id [auto_increment], personkey, personid)
insert #pmin
select personkey, min(personid) as personid
from #p
group by personkey
order by 1

Loop through the records of #pmin, update whoisonboard, and tidy persondata.

declare variables
initialize variables

loop through #pmin from id = 1 to [max]
begin loop
    increment counter
    store the values of personkey and personid for this iteration
        select @thisVal = personkey, @idMin = personid from #pmin where id = @i
    store all values of personid for this personkey from #p (I used a table variable @a)
        insert @a select personid from #p where personkey = @thisVal
    update whoisonboard set personid = min personid for all values of personid
        update whoisonboard set personid = @idMin where personid in (select personid from @a)
    delete all but the first persondata record for this iteration
        delete persondata where personid in (select personid from @a where personid <> @idMin)
end loop

My code also included some other steps that I needed to perform for my case, as well as a lot of testing/data comparison code to verify I did the right thing at each step.

  • Report dates, pay, or whatever for each person before and after. They should match exactly.
  • Verify you got the first and last record.
  • other checks as you see fit

Altogether, my code was about 600 lines. (That's why I didn't want to go to that extent here.) But what I have provided here should be a sufficient outline to accomplish your task.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM