How to mark duplicates in an SQL query

Question

I have an SQL query which looks at date-of-birth, last name and a soundex of first name to identify duplicates. The following query finds some 8,000 rows (which I assume means there are around 8,000 duplicate records).

select dob,last_name,soundex(first_name),count(*)
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1

Almost all of the results have a count of 2, a few have a count of 3 where obviously the record existed twice in one of the two databases which were merged.

The next step I need to take is to mark one of the rows, doesn't really matter, with a duplicate flag and to mark each row with the opposite rows key. Is there a way of doing this using SQL?

Answer 1

Here is a query that will give you not only the duplicates, but also the first id inserted (assuming Id is the sequential primary-key column) and the newest id.

OTTOMH

select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1

You can use this in a JOIN to do your update

UPDATE Clients
SET OppositeRowId = DuplicateRows.NewestId
FROM
(
    select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
    from clients
    group by dob,last_name,soundex(first_name)
    having count(*) >1
) DuplicateRows
WHERE
DuplicateRows.OldestId = Clients.Id

All of this assumes that you have one duplicate . If you have more than one, you are going to have to try something different.

Answer 2

好吧，您可以使用SELECT DISTINCT，然后将单行标记为“ notplicate”-然后搜索“ notplicate”的行以找到重复的行。

Answer 3

This should do what you are after, the UPDATE in one go.

UPDATE FROM clients c
INNER JOIN
(
  select dob,last_name,soundex(first_name),MIN(id) as keep
  from clients
  group by dob,last_name,soundex(first_name)
  having count(*) >1
) k
ON c.dob=k.dob AND c.last_name=k.last_name AND soundex(c.first_name)=soundex(k.first_name)
SET duplicateid = NULLIF(k.keep, c.id),
    hasduplicate = (k.keep = c.id)

It assumes you have 3 columns not stated in the question

id: primary key
duplicateid: points to the dup being kept
hasduplicate: boolean, marks the one to keep

How to mark duplicates in an SQL query

Question

3 answers

solution1
2 2011-01-23 21:37:33

solution2
2 2011-01-23 21:38:53

solution3
2 ACCPTED 2011-01-23 21:52:14

How to mark duplicates in an SQL query

Question

3 answers

solution1 2 2011-01-23 21:37:33

solution2 2 2011-01-23 21:38:53

solution3 2 ACCPTED 2011-01-23 21:52:14

solution1
2 2011-01-23 21:37:33

solution2
2 2011-01-23 21:38:53

solution3
2 ACCPTED 2011-01-23 21:52:14