简体   繁体   中英

Oracle SQL : distinct count with group by cube

My objective :

I want to distinctly count every possible combinations by using a group by cube.

The query I use :

select         col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                count(distinct col_1)    count_distinct
from            tmp_test_data
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

Example data (complete table has 100K rows) :

col3 col4 col5 col6 col7 col8 col9 count_distinct   
2    3    1    1    1    1    1    12
2    3    1    1    1    1         12
2    3    1    1    1    2    1    1
2    3    1    1    1    2    2    8
2    3    1    1    1    2         9
2    3    1    1    1         1    13
2    3    1    1    1         2    8
2    3    1    1    1              21
...

The problem I am facing : Using count(distinct col_1) affects the performance of the query (~10 minutes), whereas count(col1) is pretty fast (~10 seconds). When checking the explain plan it appears that the distinct count forces 64 'group by rollup'

Explain plan :

count(col1)

Plan hash value: 3126999781
| Id  | Operation            | Name          | Rows  | Bytes | Cost (%CPU)| Time     |

|   0 | SELECT STATEMENT     |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   1 |  SORT GROUP BY       |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   2 |   GENERATE CUBE      |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   3 |    SORT GROUP BY     |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| TMP_TEST_DATA |   668K|    19M|  1296   (1)| 00:00:01 |

count(distinct col_1

Plan hash value: 1939696204

---------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |   288 | 29952 | 50234   (4)| 00:00:02 |
|   1 |  TEMP TABLE TRANSFORMATION |                            |       |       |            |          |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C98_2ACFFE0 |       |       |            |          |
|   3 |    TABLE ACCESS FULL       | TMP_TEST_DATA              |   668K|    19M|  1296   (1)| 00:00:01 |
|   4 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
|   5 |    SORT GROUP BY ROLLUP    |                            |   288 |  8640 |   765   (4)| 00:00:01 |
|   6 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
|   7 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
|   8 |    SORT GROUP BY ROLLUP    |                            |   204 |  6120 |   765   (4)| 00:00:01 |
|   9 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
...
| 190 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
| 191 |    SORT GROUP BY ROLLUP    |                            |     3 |    90 |   765   (4)| 00:00:01 |
| 192 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
| 193 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
| 194 |    SORT GROUP BY ROLLUP    |                            |     2 |    60 |   765   (4)| 00:00:01 |
| 195 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
| 196 |   SORT ORDER BY            |                            |   288 | 29952 |     3  (34)| 00:00:01 |
| 197 |    VIEW                    |                            |   288 | 29952 |     2   (0)| 00:00:01 |
| 198 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C9A_2ACFFE0 |   288 |  8640 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Is there a way to improve this ?

Short Answer

No, I do not see a way to improve this, if you really need exact count_distinct results.

If you can live with an approximation, then using the function APPROX_COUNT_DISTINCT might be an option.

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                approx_count_distinct(col_1)    approx_count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

Longer Answer

I created this test table

CREATE TABLE t AS
SELECT round(abs(dbms_random.normal)*10,0) AS col_1,
       round(dbms_random.VALUE(2,3),0) AS col_3,
       round(dbms_random.VALUE(3,4),0) AS col_4,
       round(dbms_random.VALUE(1,2),0) AS col_5,
       round(dbms_random.VALUE(1,2),0) AS col_6,
       round(dbms_random.VALUE(1,2),0) AS col_7,
       round(dbms_random.VALUE(1,2),0) AS col_8,
       round(dbms_random.VALUE(1,2),0) AS col_9
  FROM xmltable('1 to 20000');

set the statistics_level to ALL to gather detailed execution plan statistics

ALTER SESSION SET statistics_level = 'ALL';

executed the original query on table t instead of tmp_test_data

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                count(distinct col_1)    count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

to produced this result

COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 COUNT_DISTINCT
---------- ---------- ---------- ---------- ---------- ---------- ---------- --------------
         2          3          1          1          1          1          1             27
         2          3          1          1          1          1          2             24
         2          3          1          1          1          1                        31
...
                                                                           2             40
                                                                                         41

2.187 rows selected.

and this execution plan.

---------------------------------------------------------------------------------------------------
| Id  | Operation                                |Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |     1 |        |   2187 |00:00:01.85 |      87 |
|   1 |  TEMP TABLE TRANSFORMATION               |     1 |        |   2187 |00:00:01.85 |      87 |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|     1 |        |      0 |00:00:00.07 |      86 |
|   3 |    HASH GROUP BY                         |     1 |    464 |   3224 |00:00:00.02 |      85 |
|   4 |     TABLE ACCESS FULL                    |     1 |  20000 |  20000 |00:00:00.01 |      85 |
|   5 |   SORT ORDER BY                          |     1 |     16 |   2187 |00:00:01.77 |       0 |
|   6 |    VIEW                                  |     1 |    408 |   2187 |00:00:01.75 |       0 |
|   7 |     VIEW                                 |     1 |    408 |   2187 |00:00:01.73 |       0 |
|   8 |      UNION-ALL                           |     1 |        |   2187 |00:00:01.72 |       0 |
|   9 |       SORT GROUP BY ROLLUP               |     1 |     16 |    192 |00:00:00.03 |       0 |
...
| 133 |       SORT GROUP BY ROLLUP               |     1 |      3 |      6 |00:00:00.03 |       0 |
| 134 |        TABLE ACCESS FULL                 |     1 |    464 |   3224 |00:00:00.01 |       0 |
| 135 |       SORT GROUP BY ROLLUP               |     1 |      2 |      3 |00:00:00.02 |       0 |
| 136 |        TABLE ACCESS FULL                 |     1 |    464 |   3224 |00:00:00.01 |       0 |
---------------------------------------------------------------------------------------------------

Interesting is the column A-Rows (actual number of rows) A-Time (actual amount of time spent) and Buffers (number of logical reads). We see that the query took 1.85 seconds for 87 logical I/Os. All 64 SORT GROUP BY ROLLUP took 1.75 seconds, that's about 0.03 seconds per operation. Oracle needs to evaluate the number of distinct values of col_1 for each group-by-combination. There is no shortcut as in COUNT(col_1) . That's why it is costly.

However, we could easily come up with an alternative query

WITH
   combi AS (
      SELECT col_3, 
             col_4, 
             col_5, 
             col_6, 
             col_7, 
             col_8, 
             col_9
        FROM t
       GROUP BY CUBE (
                   col_3, 
                   col_4, 
                   col_5, 
                   col_6, 
                   col_7, 
                   col_8, 
                   col_9
                )
   ),
   fullset AS (
      SELECT t.col_1,
             combi.col_3, 
             combi.col_4, 
             combi.col_5, 
             combi.col_6, 
             combi.col_7, 
             combi.col_8, 
             combi.col_9
        FROM combi
        JOIN t
          ON     (t.col_3 = combi.col_3 or combi.col_3 is null)
             AND (t.col_4 = combi.col_4 or combi.col_4 is null)
             AND (t.col_5 = combi.col_5 or combi.col_5 is null)
             AND (t.col_6 = combi.col_6 or combi.col_6 is null)
             AND (t.col_7 = combi.col_7 or combi.col_7 is null)
             AND (t.col_8 = combi.col_8 or combi.col_8 is null)
             AND (t.col_9 = combi.col_9 or combi.col_9 is null)
   )
SELECT col_3, 
       col_4, 
       col_5, 
       col_6, 
       col_7, 
       col_8, 
       col_9,
       COUNT(DISTINCT col_1) as count_distinct_col_1
  FROM fullset
 GROUP BY col_3, 
          col_4, 
          col_5, 
          col_6, 
          col_7, 
          col_8, 
          col_9
 ORDER BY col_3, 
          col_4, 
          col_5, 
          col_6, 
          col_7, 
          col_8, 
          col_9;

producing the same result

COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 COUNT_DISTINCT_COL_1
---------- ---------- ---------- ---------- ---------- ---------- ---------- --------------------
         2          3          1          1          1          1          1                   27
         2          3          1          1          1          1          2                   24
         2          3          1          1          1          1                              31
...
                                                                           2                   40
                                                                                               41

2.187 rows selected.

with fewer lines in the execution plan.

-------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |      1 |        |   2187 |00:00:41.58 |     185K|
|   1 |  SORT GROUP BY            |           |      1 |     16 |   2187 |00:00:41.58 |     185K|
|   2 |   VIEW                    | VM_NWVW_1 |      1 |    464 |  67812 |00:00:41.54 |     185K|
|   3 |    HASH GROUP BY          |           |      1 |    464 |  67812 |00:00:41.54 |     185K|
|   4 |     NESTED LOOPS          |           |      1 |   2500 |   2560K|00:00:31.77 |     185K|
|   5 |      VIEW                 |           |      1 |     16 |   2187 |00:00:00.37 |      85 |
|   6 |       SORT GROUP BY       |           |      1 |     16 |   2187 |00:00:00.36 |      85 |
|   7 |        GENERATE CUBE      |           |      1 |     16 |  16384 |00:00:00.27 |      85 |
|   8 |         SORT GROUP BY     |           |      1 |     16 |    128 |00:00:00.20 |      85 |
|   9 |          TABLE ACCESS FULL| T         |      1 |  20000 |  20000 |00:00:00.10 |      85 |
|* 10 |      TABLE ACCESS FULL    | T         |   2187 |    156 |   2560K|00:00:13.09 |     185K|
-------------------------------------------------------------------------------------------------

Let's look at operation 5. We produce all 2187 combinations within 0.37 seconds and need 85 logical I/Os to read the full table t . Then we access the full table t again for each of these 2187 combinations (see operation 4 and 10). The complete join takes 31.77 seconds. The remaining group by operations takes 9.77 seconds and the final sort just 0.04. seconds.

This alternative query looks simple, but is much slower due to additional I/O operations necessary for the join of the named queries combi and fullset .

The original view is better in terms of I/O and runtime. Granted the execution plan looks extensive, but is efficient. In the end the DISTINCT in COUNT(DISTINCT col_1) is driving the complexity. It's just a word, but a complete different algorithm. Hence, I do not see how to improve the original query, if accurate results are important. However, if an approximation is good enough then using the function APPROX_COUNT_DISTINCT might be an option.

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                approx_count_distinct(col_1)    approx_count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

The results are similar

COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 APPROX_COUNT_DISTINCT
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------------------
         2          3          1          1          1          1          1                    27
         2          3          1          1          1          1          2                    24
         2          3          1          1          1          1                               31
...
                                                                           2                    40
                                                                                                41

2.187 rows selected.

but execution plan is even more complex.

----------------------------------------------------------------------------------------------------
| Id  | Operation                                | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |      1 |        |   2187 |00:00:09.88 |      87 |
|   1 |  TEMP TABLE TRANSFORMATION               |      1 |        |   2187 |00:00:09.88 |      87 |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.33 |      86 |
|   3 |    TABLE ACCESS FULL                     |      1 |  20000 |  20000 |00:00:00.08 |      85 |
|   4 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.16 |       0 |
|   5 |    SORT GROUP BY ROLLUP APPROX           |      1 |     16 |    192 |00:00:00.16 |       0 |
|   6 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
...
| 190 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.14 |       0 |
| 191 |    SORT GROUP BY ROLLUP APPROX           |      1 |      3 |      6 |00:00:00.14 |       0 |
| 192 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
| 193 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.14 |       0 |
| 194 |    SORT GROUP BY ROLLUP APPROX           |      1 |      2 |      3 |00:00:00.14 |       0 |
| 195 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
| 196 |   SORT ORDER BY                          |      1 |     16 |   2187 |00:00:00.01 |       0 |
| 197 |    VIEW                                  |      1 |     16 |   2187 |00:00:00.01 |       0 |
| 198 |     TABLE ACCESS FULL                    |      1 |     16 |   2187 |00:00:00.01 |       0 |
----------------------------------------------------------------------------------------------------

and the query is slower than the original one. It is expected to be faster on large data sets. Therefore, I suggest to try APPROX_COUNT_DISTINCT , if a 100% accuracy is not required.

Runtime Overhead by Statistic_Level ALL

To get the actual number of rows and the actual time spent in the execution plan, I've run all queries with statistics_level ALL . This lead to a significant performance overhead (that's expected, see also Jonathan Lewis' blog about gather_plan_staticis ). When setting staticstics_level to TYPICAL all queries run faster. Here are the runtimes in seconds incl. time for printing result on the client:

Query                  Runtime with 'ALL'  Runtime with 'TYPICAL' 
----------------       ------------------  ----------------------
Original (good)                     2.615                   0.977
Alternative (bad)                  41.773                   4.991
Approx_Count_Distinct              10.600                   1.113

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM