I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a)
, on b
, and on (md5(a), b)
. (In reality, a
may contain values longer than 4k chars.) There is also a primary key column of type SERIAL
which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t
were applied, z
is a count of distinct values of b
for every partition over a
, and only rows with a z
value greater than 2 are included.
I can get just the z
filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?
Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a)
in place of a
, since a
can obviously be very long, and you already have an index on md5(a)
etc.
Since your table is big , you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b)
is instrumental but - assuming b
, u
, and t
are small columns - an index on (md5(a), b, u, t)
would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON
:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.