简体   繁体   中英

How to count distinct combinations of values across multiple columns?

I have a table that looks a bit like this (lets call it table1):

column1 | column2 | column3
--------|---------|--------
111     | 222     | 333
111     | 222     | 333 
222     | 111     | 333
222     | 333     | 444

Edit for clarification: The data is in varchar format as some but not all of it has alphabetic characters in. There's also 12 columns and over 100 possible values for those columns to have, rather than the 3 columns and 4 values in my example. I just trimmed it down for the example to try and make things clearer!

What I want to do is count the number of rows which have the same three values regardless of which columns they are in, ie these three rows:

column1 | column2 | column3
--------|---------|--------
111     | 222     | 333
111     | 222     | 333 
222     | 111     | 333

Because each row has 111, 222, and 333 in it, all three of these rows should be counted regardless of the fact that they have those values in different columns.

I can get a count of duplicate rows with this code, but for the example above it would return a count of 2 instead of the desired 3:

select  count(distinct combination)
from
        (
        select  column1||column2||column3 as combination
        from    table1
        )
; 

Does anyone know how to do this please?

This isn't a specific DB2 solution but I tried this on MySQL and I suppose this will work on DB2 as well. Not sure whether it solves your problem 100% but hopefully it helps.

In the where clause of the first select within "concat_union" I specified a concatenated value which is then regarded as the 'original' string. It doesn't matter which combination ('111222333' vs. '222111333' vs. '333111222') you choose as the original - but you have to specify one so you're able to look for duplicates using the union. The union consists of a concatetation of the columns using each possible combination. The last step is then to count the values which look like the original.

with concat_union as (
SELECT CONCAT(column1, column2, column3) as COMB, 'ORIG' as TYPE from table1 
    where CONCAT(column1, column2, column3) = '111222333'
UNION ALL
SELECT CONCAT(column1, column3, column2) as COMB, 'GEN' as TYPE from table1
UNION ALL 
SELECT CONCAT(column2, column1, column3) as COMB, 'GEN' as TYPE from table1 
UNION ALL
SELECT CONCAT(column2, column3, column1) as COMB, 'GEN' as TYPE from table1 
UNION ALL
SELECT CONCAT(column3, column1, column2) as COMB, 'GEN' as TYPE from table1 
UNION ALL
SELECT CONCAT(column3, column2, column1) as COMB, 'GEN' as TYPE from table1
)
select count(COMB) from concat_union where COMB = (select distinct COMB from 
concat_union where TYPE = 'ORIG');

You can try it out here https://sqlize.online/sql/mysql80/7df20f2753dec79e8bfb6a2e4f6445f7/

Try this as is:

WITH 
  ORIG (column1, column2, column3) AS
(
    -- Original table values
    VALUES
      (111, 222, 333)
    , (111, 222, 333)
    , (222, 111, 333)
    , (222, 333, 444)
    --, (222, 444, 333)
)
, ENUM AS
(
    -- Row enumeration
    SELECT ROW_NUMBER () OVER () AS RN_, O.*
    FROM ORIG O
)

SELECT SUM (CNT)
FROM
(
    -- Number of rows for each the same ordered list of column values
    SELECT LST, COUNT (1) CNT
    FROM
    (
        -- Columns to rows 
        -- and get ordered list of column values for each original row
        SELECT E.RN_, LISTAGG (C.V, ',') WITHIN GROUP (ORDER BY C.V) LST
        FROM ENUM E
        CROSS JOIN TABLE (VALUES E.column1, E.column2, E.column3) C (V)
        GROUP BY E.RN_
    )
    GROUP BY LST
    HAVING COUNT (1) > 1
)

Order the data in the row so that column1 <= column2 <= column3 and rows are now comparable. The query for the 3 columns case, Postrgesql, not sure on DB2 syntax.

with tbl (column1, column2, column3) as (
  -- sample data
  select 111, 222, 333 union all 
  select 111, 333, 222 union all
  select 222, 111, 333 union all
  select 222, 333, 111 union all
  select 333, 222, 111 union all 
  select 333, 111, 222
)
select c1, c2, c3, count(*) n
from (
  select least(column1, column2, column3) c1 
       , greatest(least(column1, column2), least(column2, column3), least(column1, column3)) c2
       , greatest(column1, column2, column3) c3
  from tbl
) t 
group by c1, c2, c3;

returns

c1  c2  c3  n
111 222 333 6

db<>fiddle

I will first UNPIVOT (hope you have a PK on your table), then using to a MD5 checksum on JSON_ARRAYAGG( val ORDER BY val /* RETURNING CLOB if JSON text is bigger than the MAX size of the VARCHAR on your platform */ ), then will be trivial to find PK of rows where the checksums are identical. You could also use a string aggregation function of your platform if it supports ORDER BY and optionally is able to return CLOB if needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM