PostgreSQL distinct rows joined with a count of distinct values in one column

Question

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:

  a  | b | u  | t 
-----+---+----+----
 foo | 1 |  1 | 10
 foo | 1 |  2 | 11
 foo | 1 |  2 | 11
 foo | 2 |  4 | 1
 foo | 3 |  5 | 2
 bar | 1 |  6 | 2
 bar | 2 |  7 | 2
 bar | 2 |  8 | 3
 bar | 3 |  9 | 4
 bar | 4 | 10 | 5
 bar | 5 | 11 | 6
 baz | 1 | 12 | 1
 baz | 1 | 13 | 2
 baz | 1 | 13 | 2
 baz | 1 | 13 | 3

There are indices on md5(a) , on b , and on (md5(a), b) . (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.

I'm trying to build a query which will return the following results:

  a  | b | u  | t  | z 
-----+---+----+----+---
 foo | 1 |  1 | 10 | 3
 foo | 1 |  2 | 11 | 3
 foo | 2 |  4 | 1  | 3
 foo | 3 |  5 | 2  | 3
 bar | 1 |  6 | 2  | 5
 bar | 2 |  7 | 2  | 5
 bar | 2 |  8 | 3  | 5
 bar | 3 |  9 | 4  | 5
 bar | 4 | 10 | 5  | 5
 bar | 5 | 11 | 6  | 5

In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a , and only rows with a z value greater than 2 are included.

I can get just the z filter working as follows:

SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
  HAVING COUNT(b) > 2;

However, I'm stumped on combining this with the rest of the data in the table.

What's the most efficient way to do this?

Answer 1

Your first step can be simpler already:

SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM   t
GROUP  BY 1
HAVING count(DISTINCT b) > 2;

Working with md5(a) in place of a , since a can obviously be very long, and you already have an index on md5(a) etc.

Since your table is big , you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b , u , and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).

Your desired end result:

SELECT DISTINCT ON (md5(t.a), b, u, t)
       t.a, t.b, t.u, t.t, a.z
FROM  (
   SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
   FROM   t
   GROUP  BY 1
   HAVING count(DISTINCT b) > 2
   ) a
JOIN   t ON md5(t.a) = md5_a
ORDER  BY 1, 2, 3, 4;  -- optional

Or probably faster, yet:

SELECT a, b, u, t, z
FROM  (
   SELECT DISTINCT ON (1, 2, 3, 4)
          md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
   FROM   t
   ) t
JOIN  (
   SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
   FROM   t
   GROUP  BY 1
   HAVING count(DISTINCT b) > 2
   ) z USING (md5_a)
ORDER  BY 1, 2, 3, 4;  -- optional

Detailed explanation for DISTINCT ON :

Select first row in each GROUP BY group?

PostgreSQL distinct rows joined with a count of distinct values in one column

Question

1 answers

solution1
3 ACCPTED 2015-03-02 09:50:26

PostgreSQL distinct rows joined with a count of distinct values in one column

Question

1 answers

solution1 3 ACCPTED 2015-03-02 09:50:26

solution1
3 ACCPTED 2015-03-02 09:50:26