postgresql COUNT(DISTINCT ...) very slow

Question

I have a very simple SQL query:

SELECT COUNT(DISTINCT x) FROM table;

My table has about 1.5 million rows. This query is running pretty slowly; it takes about 7.5s, compared to

 SELECT COUNT(x) FROM table;

which takes about 435ms. Is there any way to change my query to improve performance? I've tried grouping and doing a regular count, as well as putting an index on x; both have the same 7.5s execution time.

Answer 1

You can use this:

SELECT COUNT(*) FROM (SELECT DISTINCT column_name FROM table_name) AS temp;

This is much faster than:

COUNT(DISTINCT column_name)

Answer 2

-- My default settings (this is basically a single-session machine, so work_mem is pretty high)
SET effective_cache_size='2048MB';
SET work_mem='16MB';

\echo original
EXPLAIN ANALYZE
SELECT
        COUNT (distinct val) as aantal
FROM one
        ;

\echo group by+count(*)
EXPLAIN ANALYZE
SELECT
        distinct val
       -- , COUNT(*)
FROM one
GROUP BY val;

\echo with CTE
EXPLAIN ANALYZE
WITH agg AS (
    SELECT distinct val
    FROM one
    GROUP BY val
    )
SELECT COUNT (*) as aantal
FROM agg
        ;

Results:

original                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=36448.06..36448.07 rows=1 width=4) (actual time=1766.472..1766.472 rows=1 loops=1)
   ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=31.371..185.914 rows=1499845 loops=1)
 Total runtime: 1766.642 ms
(3 rows)

group by+count(*)
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=36464.31..36477.31 rows=1300 width=4) (actual time=412.470..412.598 rows=1300 loops=1)
   ->  HashAggregate  (cost=36448.06..36461.06 rows=1300 width=4) (actual time=412.066..412.203 rows=1300 loops=1)
         ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=26.134..166.846 rows=1499845 loops=1)
 Total runtime: 412.686 ms
(4 rows)

with CTE
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=36506.56..36506.57 rows=1 width=0) (actual time=408.239..408.239 rows=1 loops=1)
   CTE agg
     ->  HashAggregate  (cost=36464.31..36477.31 rows=1300 width=4) (actual time=407.704..407.847 rows=1300 loops=1)
           ->  HashAggregate  (cost=36448.06..36461.06 rows=1300 width=4) (actual time=407.320..407.467 rows=1300 loops=1)
                 ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=24.321..165.256 rows=1499845 loops=1)
       ->  CTE Scan on agg  (cost=0.00..26.00 rows=1300 width=0) (actual time=407.707..408.154 rows=1300 loops=1)
     Total runtime: 408.300 ms
    (7 rows)

The same plan as for the CTE could probably also be produced by other methods (window functions)

Answer 3

If your count(distinct(x)) is significantly slower than count(x) then you can speed up this query by maintaining x value counts in different table, for example table_name_x_counts (x integer not null, x_count int not null) , using triggers. But your write performance will suffer and if you update multiple x values in single transaction then you'd need to do this in some explicit order to avoid possible deadlock.

Answer 4

I was also searching same answer, because at some point of time I needed total_count with distinct values along with limit/offset .

Because it's little tricky to do- To get total count with distinct values along with limit/offset. Usually it's hard to get total count with limit/offset. Finally I got the way to do -

SELECT DISTINCT COUNT(*) OVER() as total_count, * FROM table_name limit 2 offset 0;

Query performance is also high.

Answer 5

I had a similar problem, but I had multiple columns I wanted to count. So I tried these 2 queries.

Count Distinct:

SELECT
       to_char(action_date, 'YYYY-MM') as "Month",
       count(*) as "Count",
       count(distinct batch_id)
FROM transactions t
         JOIN batches b on t.batch_id = b.id
GROUP BY to_char(action_date, 'YYYY-MM')
ORDER BY to_char(action_date, 'YYYY-MM');

Sub-Query:

WITH batch_counts AS (
    SELECT to_char(action_date, 'YYYY-MM') as "Month",
           COUNT(*)                        as t_count
    FROM transactions t
             JOIN batches b on t.batch_id = b.id
    GROUP BY b.id
)
SELECT "Month",
       SUM(t_count) as "Transactions",
       COUNT(*)     as "Batches"
FROM batch_counts
GROUP BY "Month"
ORDER BY "Month";

I ran both of these queries multiple on my test data of about 100k rows, the sub-query approach ran in ~90ms on average, but the count distinct approach took about ~200ms on average.

Answer 6

I have a very simple SQL query:

SELECT COUNT(DISTINCT x) FROM table;

My table has about 1.5 million rows. This query is running pretty slowly; it takes about 7.5s, compared to

 SELECT COUNT(x) FROM table;

which takes about 435ms. Is there any way to change my query to improve performance? I've tried grouping and doing a regular count, as well as putting an index on x; both have the same 7.5s execution time.

postgresql COUNT(DISTINCT ...) very slow

Question

5 answers

solution1
454 2013-02-06 15:17:09

solution2
15 2012-06-28 18:32:27

solution3
4 2012-06-30 18:21:16

solution4
1 2018-03-14 13:03:51

solution5
0 2022-11-08 10:16:28

solution6
-1 2021-03-20 15:34:04

postgresql COUNT(DISTINCT ...) very slow

Question

5 answers

solution1 454 2013-02-06 15:17:09

solution2 15 2012-06-28 18:32:27

solution3 4 2012-06-30 18:21:16

solution4 1 2018-03-14 13:03:51

solution5 0 2022-11-08 10:16:28

solution6 -1 2021-03-20 15:34:04

solution1
454 2013-02-06 15:17:09

solution2
15 2012-06-28 18:32:27

solution3
4 2012-06-30 18:21:16

solution4
1 2018-03-14 13:03:51

solution5
0 2022-11-08 10:16:28

solution6
-1 2021-03-20 15:34:04