How can I use PostgreSQL's DISTINCT ON clause to also return a count of the duplicates?

Question

Suppose I have a table like this

+--------+--------+------+--------+---------+
|   A    |   B    |  C   |   g    |    h    |
+--------+--------+------+--------+---------+
| cat    | dog    | bird | 34.223 |  54.223 |
| cat    | pigeon | goat |  23.23 |  54.948 |
| cat    | dog    | bird | 17.386 |  26.398 |
| gopher | pigeon | bird | 23.552 |  89.223 |
+--------+--------+------+--------+---------+

but with many more fields to the right (i, j, k, ...).

I need a resulting table that looks like:

+-----+--------+------+-----+-----+-----+-----+-------+
|  A  |   B    |  C   |  g  |  h  | ... |  z  | count |
+-----+--------+------+-----+-----+-----+-----+-------+
| cat | dog    | bird | xxx | xxx |     | xxx |    23 |
| cat | pigeon | goat | xxx | xxx |     | xxx |    78 |
+-----+--------+------+-----+-----+-----+-----+-------+

I would normally use a GROUP BY, but I don't want to have to repeat all of the column names (g, h, i, ... z).

I can currently get the result I want using a window function combined with DISTINCT ON, but the query is very slow to run (500k+ records), and has a lot of duplication

WITH temp AS (
    SELECT a, b, c, COUNT(*)
    FROM my_table
    GROUP BY a, b, C
)
SELECT DISTINCT ON (a, b, c) *, (
    SELECT count
    FROM temp
    WHERE 
        temp.a = t.a 
        AND temp.b = t.b 
        AND temp.c = t.c
) as count
FROM my_table as t
ORDER BY a, b, c, x, y;

Is there a way to somehow get the count of the rows that were elimated with DISTINCT in a more efficient manner? Something like

SELECT DISTINCT ON (a, b, c)
    *, COUNT(*)
FROM my_table
ORDER BY a, b, c, count;

Or am I taking the wrong approach to begin with?

Answer 1

Use COUNT() with PARTITION BY :

SELECT DISTINCT ON (a, b, c) *, COUNT(*) OVER (PARTITION BY a, b, c)
FROM my_table

You should probably also add an ORDER to your query if you care at all about the rest of the fields, otherwise the rows used to get the data displayed in those fields may be inconsistent.

How can I use PostgreSQL's DISTINCT ON clause to also return a count of the duplicates?

Question

1 answers

solution1
2 ACCPTED 2018-11-29 19:05:44

How can I use PostgreSQL's DISTINCT ON clause to also return a count of the duplicates?

Question

1 answers

solution1 2 ACCPTED 2018-11-29 19:05:44

solution1
2 ACCPTED 2018-11-29 19:05:44