简体   繁体   中英

PostgreSQL Select distinct values of comma separated values column excluding subsets

Assume having aa table foo with column bar carries comma separated values ('a,b' , 'a,b,c' , 'a,b,c,d' , 'd,e') How to select the largest combination and exclude all the subsets included in that combination (the largest one) :

example on the above data-set , the result should be :

('a,b,c,d' , 'd,e' ) and the first two entities ('a,b', 'a,b,c') are excluded as they are subset of ('a,b,c,d') .

taking in consideration that all the values in the comma-separated string are sorted alphabetically

I tried the below query but the results seem a little far away from what I need :

select distinct a.bar from foo a   inner join foo b
on a.bar like '%'|| b.bar||'%' 
and a.bar != b.bar

You can use string_to_array() to split the strings into an array. With the contains operator @> you can check whether an array contains another. (See "9.18. Array Functions and Operators" .)

Use that in a NOT EXISTS clause. fi.ctid <> fo.ctid is there to make sure the physical addresses of the compared pair of rows is not equal, as of course an array of one row would contain the array compared to the same row.

SELECT fo.bar
       FROM foo fo
       WHERE NOT EXISTS (SELECT *
                                FROM foo fi
                                WHERE fi.ctid <> fo.ctid
                                      AND string_to_array(fi.bar, ',') @> string_to_array(fo.bar, ','));

SQL Fiddle

But I cannot resist: Don't use comma separated strings in a relational database. You've got something way better. It's called "table".

first process the string into sets of characters, then cross join the character-sets with itself, excluding rows where the character-sets on both sides are the same.

next, aggregate & use BOOL_OR in a HAVING clause to filter out any character-set that is a subset of any other character-set

With a sample table delcared in the cte, the query becomes:

WITH foo(bar) AS (SELECT '("a,b" , "a,b,c" , "a,b,c,d" , "d,e")'::TEXT)
SELECT bar, string_to_array(elems[1], ',') not_subset
FROM foo
CROSS JOIN regexp_matches(bar, '[\w|,]+', 'g') elems 
CROSS JOIN regexp_matches(bar, '[\w|,]+', 'g') elems2
WHERE elems2[1] != elems[1] 
  -- my regex also matches the ',' between sets which need to be ignored
  -- alternatively, i have to refine the regex
  AND elems2[1] != ','
  AND elems[1] != ','
GROUP BY 1, 2
HAVING NOT BOOL_OR(string_to_array(elems[1], ',') <@ string_to_array(elems2[1], ','))

produces the output

bar                                     not_subset
'("a,b" , "a,b,c" , "a,b,c,d" , "d,e")' {'d','e'}
'("a,b" , "a,b,c" , "a,b,c,d" , "d,e")' {'a','b','c','d'}

example in sql fiddle

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM