简体   繁体   中英

Derive groups of records that match over multiple columns, but where some column values might be NULL

I would like an efficient means of deriving groups of matching records across multiple fields. Let's say I have the following table:

CREATE TABLE cust
(
    id INT NOT NULL,
    class VARCHAR(1) NULL,
    cust_type VARCHAR(1) NULL,
    terms VARCHAR(1) NULL
);

INSERT INTO cust
VALUES
    (1,'A',NULL,'C'),
    (2,NULL,'B','C'),
    (3,'A','B',NULL),
    (4,NULL,NULL,'C'),
    (5,'D','E',NULL),
    (6,'D',NULL,NULL);

What I am looking to get is the set of IDs for which matching values unify a set of records over the three fields (class, cust_type and terms), so that I can apply a unique ID to the group.

In the example, records 1-4 constitute one match group over the three fields, while records 5-6 form a separate match.

The following does the job:

SELECT
    DISTINCT
    a.id,
    DENSE_RANK() OVER (ORDER BY max(b.class),max(b.cust_type),max(b.terms)) AS match_group
FROM cust AS a
INNER JOIN
    cust AS b
ON
    a.class = b.class
    OR a.cust_type = b.cust_type
    OR a.terms = b.terms
GROUP BY a.id
ORDER BY a.id

id match_group
-- -----------
 1 1
 2 1
 3 1
 4 1
 5 2
 6 2
**But, is there a better way?** Running this query on a table of over a million rows is painful...

As Graham pointed out in the comments, the above query doesn't satisfy the requirements if another record is added that would group all the records together.

The following values should be grouped together in one group:

INSERT INTO cust
VALUES
    (1,'A',NULL,'C'),
    (2,NULL,'B','C'),
    (3,'A','B',NULL),
    (4,NULL,NULL,'C'),
    (5,'D','E',NULL),
    (6,'D',NULL,NULL),
    (7,'D','B','C');

Would yield:

id match_group
-- -----------
 1 1
 2 1
 3 1
 4 1
 5 1
 6 1

...because the class value of D groups records 5, 6 and 7. The terms value of C matches records 1, 2 and 4 to that group, and cust_type value B ( or class value A ) pulls in record 3.

Hopefully that all makes sense.

I don't think you can do this with a (recursive) Select. I did something similar (trying to identify unique households) using a temporary table & repeated updates using following logic:

For each class|cust_type|terms get the minimum id and update that temp table:

update temp
from
 (
  SELECT
    class, -- similar for cust_type & terms
    min(id) as min_id
  from temp
  group by class
 ) x
set id = min_id
where temp.class = x.class
  and temp.id <> x.min_id
;

Repeat all three updates until none of them updates a row.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM