简体   繁体   English

PostgreSQL选择逗号分隔值列的不同值(不包括子集)

[英]PostgreSQL Select distinct values of comma separated values column excluding subsets

Assume having aa table foo with column bar carries comma separated values ('a,b' , 'a,b,c' , 'a,b,c,d' , 'd,e') How to select the largest combination and exclude all the subsets included in that combination (the largest one) : 假设具有带有列栏的foo带有逗号分隔的值(“ a,b”,“ a,b,c”,“ a,b,c,d”,“ d,e”)如何选择最大的组合以及排除该组合中包含的所有子集(最大的子集):

example on the above data-set , the result should be : 以上述数据集为例,结果应为:

('a,b,c,d' , 'd,e' ) and the first two entities ('a,b', 'a,b,c') are excluded as they are subset of ('a,b,c,d') . ('a,b,c,d','d,e')和前两个实体('a,b','a,b,c')被排除,因为它们是('a,b, c,d')

taking in consideration that all the values in the comma-separated string are sorted alphabetically 考虑到逗号分隔的字符串中的所有值均按字母顺序排序

I tried the below query but the results seem a little far away from what I need : 我尝试了以下查询,但结果似乎与我需要的有点相去甚远:

select distinct a.bar from foo a   inner join foo b
on a.bar like '%'|| b.bar||'%' 
and a.bar != b.bar

You can use string_to_array() to split the strings into an array. 您可以使用string_to_array()将字符串拆分为一个数组。 With the contains operator @> you can check whether an array contains another. 使用包含运算符@>您可以检查一个数组是否包含另一个。 (See "9.18. Array Functions and Operators" .) (请参阅“ 9.18。数组函数和运算符” 。)

Use that in a NOT EXISTS clause. NOT EXISTS子句中使用它。 fi.ctid <> fo.ctid is there to make sure the physical addresses of the compared pair of rows is not equal, as of course an array of one row would contain the array compared to the same row. fi.ctid <> fo.ctid可以确保比较的行对的物理地址不相等,因为与同一行相比,一行的数组当然会包含该数组。

SELECT fo.bar
       FROM foo fo
       WHERE NOT EXISTS (SELECT *
                                FROM foo fi
                                WHERE fi.ctid <> fo.ctid
                                      AND string_to_array(fi.bar, ',') @> string_to_array(fo.bar, ','));

SQL Fiddle SQL小提琴

But I cannot resist: Don't use comma separated strings in a relational database. 但是我无法抗拒:不要在关系数据库中使用逗号分隔的字符串。 You've got something way better. 你有更好的方法。 It's called "table". 它称为“表”。

first process the string into sets of characters, then cross join the character-sets with itself, excluding rows where the character-sets on both sides are the same. 首先将字符串处理成字符集,然后将字符集与自身交叉连接,不包括两边字符集相同的行。

next, aggregate & use BOOL_OR in a HAVING clause to filter out any character-set that is a subset of any other character-set 接下来,聚合并在HAVING子句中使用BOOL_OR过滤掉任何字符集,该字符集是任何其他字符集的子集

With a sample table delcared in the cte, the query becomes: 将样本表放在cte中,查询变为:

WITH foo(bar) AS (SELECT '("a,b" , "a,b,c" , "a,b,c,d" , "d,e")'::TEXT)
SELECT bar, string_to_array(elems[1], ',') not_subset
FROM foo
CROSS JOIN regexp_matches(bar, '[\w|,]+', 'g') elems 
CROSS JOIN regexp_matches(bar, '[\w|,]+', 'g') elems2
WHERE elems2[1] != elems[1] 
  -- my regex also matches the ',' between sets which need to be ignored
  -- alternatively, i have to refine the regex
  AND elems2[1] != ','
  AND elems[1] != ','
GROUP BY 1, 2
HAVING NOT BOOL_OR(string_to_array(elems[1], ',') <@ string_to_array(elems2[1], ','))

produces the output 产生输出

bar                                     not_subset
'("a,b" , "a,b,c" , "a,b,c,d" , "d,e")' {'d','e'}
'("a,b" , "a,b,c" , "a,b,c,d" , "d,e")' {'a','b','c','d'}

example in sql fiddle sql小提琴中的示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM