简体   繁体   中英

SQL for Fuzzy Match Deduplication

Table A has records with duplicate entities with subtle string variations. There is no unique key that would uniquely identify an entity. Field "ID" identifies record inside the table, but not an entity itself.

    TABLE A
    --------------
    ID;SomeString
    1;something1
    2;something2
    3;something3

By using fuzzy match software, the table A is fuzzy matched against itself, in order to detect duplicate records. That's how lookup Table B is created, which has two columns: ID1 and ID2, representing IDs of matched records from Table A.

    TABLE B
    ---------
    ID1;ID2
    1;2
    1;3
    2;1
    2;3
    3;1
    3;2

The result of deduplication would be to delete records 2 and 3 from table A, so that only first record is retained.

    TABLE A
    --------------
    ID;SomeString
    1;something1

Is there a way to perform such fuzzy match deduplication of Table A through SQL, by using Table B as fuzzy match lookup table of identified duplicate records? To clarify, I'm not asking for a way to do the fuzzy match or identify duplicates, it's already done and results are in the table B. I'm asking how to perform deletion of duplicates (and retaining one record per identified duplicate records group), according to already identified duplicate record pairs (multiple duplicate record pairs per same entity).

The main problem I see is that your table of fuzzy matches contains duplicate pairs with the order of the IDs reversed. This means you have rows to say both 2 is a duplicate of 1, and 1 is a duplicate of 2. If you deleted all the rows based on the ID2 column of Table B, you'd just end up deleting all the rows in Table A.

You can solve this problem with a select statement that rearranges the columns so that the smaller ID is always first. That way the previous example of "2 is a duplicate of 1, and 1 is a duplicate of 2" becomes just a repetition of "2 is a duplicate of 1". At that point, you can select distinct values to get a list of IDs to delete from Table A.

Based on your sample data, this query deleted the correct values:

WITH Duplicates (ID) AS
(
    SELECT DISTINCT 
        CASE
            WHEN ID1 > ID2 THEN ID1
            WHEN ID2 > ID1 THEN ID2
        END AS Duplicate
    FROM Table_B
)

DELETE
FROM Table_A
WHERE ID IN (SELECT * FROM Duplicates)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM