简体   繁体   中英

SQL select entire record found in group

I'm taking advantage of the metaphone function in PostgreSQL to find duplicate records that may have been misspelled.

SELECT metaphone(first_name, 4), metaphone(last_name, 4)
FROM people GROUP BY metaphone(last_name, 4),
metaphone(first_name, 4) HAVING COUNT(*) > 1;

This is great for showing me there are at least 100 potential duplicates in our database, but I'm not able to do much with that because I can't get any uniquely identifying information from the query results. I've tried this:

SELECT person_id, first_name, last_name
FROM people
WHERE metaphone(first_name, 16) IN (
    SELECT metaphone(first_name, 16)
    FROM people GROUP BY metaphone(last_name, 16),
    metaphone(first_name, 16) HAVING COUNT(*) > 1
)
AND metaphone(last_name, 16) IN (
    SELECT metaphone(last_name, 16)
    FROM people GROUP BY metaphone(last_name, 16),
    metaphone(first_name, 16) HAVING COUNT(*) > 1
)
ORDER BY last_name, first_name;

Which kind of works, but still contains some records that don't actually have a match of both fields. For example, I could have 2 'John Smith', 2 'Jane Smith', and 2 'John Doe'. I may only have one 'Jane Doe', but she would appear in the results of the second query.

Is there some way to more accurately get only the rows that are being used to compile the results of the first query?

You need to do both comparisons at once:

SELECT person_id, first_name, last_name
FROM people
WHERE (metaphone(first_name, 16), metaphone(last_name, 16)
      ) IN (SELECT metaphone(first_name, 16), metaphone(last_name, 16)
            FROM people
            GROUP BY metaphone(first_name, 16), metaphone(last_name, 16),
            HAVING COUNT(*) > 1
           )
ORDER BY last_name, first_name;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM