简体   繁体   中英

How to mark duplicates in an SQL query

I have an SQL query which looks at date-of-birth, last name and a soundex of first name to identify duplicates. The following query finds some 8,000 rows (which I assume means there are around 8,000 duplicate records).

select dob,last_name,soundex(first_name),count(*)
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1

Almost all of the results have a count of 2, a few have a count of 3 where obviously the record existed twice in one of the two databases which were merged.

The next step I need to take is to mark one of the rows, doesn't really matter, with a duplicate flag and to mark each row with the opposite rows key. Is there a way of doing this using SQL?

Here is a query that will give you not only the duplicates, but also the first id inserted (assuming Id is the sequential primary-key column) and the newest id.

OTTOMH

select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1

You can use this in a JOIN to do your update

UPDATE Clients
SET OppositeRowId = DuplicateRows.NewestId
FROM
(
    select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
    from clients
    group by dob,last_name,soundex(first_name)
    having count(*) >1
) DuplicateRows
WHERE
DuplicateRows.OldestId = Clients.Id

All of this assumes that you have one duplicate . If you have more than one, you are going to have to try something different.

好吧,您可以使用SELECT DISTINCT,然后将单行标记为“ notplicate”-然后搜索“ notplicate”的行以找到重复的行。

This should do what you are after, the UPDATE in one go.

UPDATE FROM clients c
INNER JOIN
(
  select dob,last_name,soundex(first_name),MIN(id) as keep
  from clients
  group by dob,last_name,soundex(first_name)
  having count(*) >1
) k
ON c.dob=k.dob AND c.last_name=k.last_name AND soundex(c.first_name)=soundex(k.first_name)
SET duplicateid = NULLIF(k.keep, c.id),
    hasduplicate = (k.keep = c.id)

It assumes you have 3 columns not stated in the question

  • id: primary key
  • duplicateid: points to the dup being kept
  • hasduplicate: boolean, marks the one to keep

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM