My objective :
I want to distinctly count every possible combinations by using a group by cube.
The query I use :
select col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9,
count(distinct col_1) count_distinct
from tmp_test_data
group by cube (
col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
)
order by col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
;
Example data (complete table has 100K rows) :
col3 col4 col5 col6 col7 col8 col9 count_distinct
2 3 1 1 1 1 1 12
2 3 1 1 1 1 12
2 3 1 1 1 2 1 1
2 3 1 1 1 2 2 8
2 3 1 1 1 2 9
2 3 1 1 1 1 13
2 3 1 1 1 2 8
2 3 1 1 1 21
...
The problem I am facing : Using count(distinct col_1) affects the performance of the query (~10 minutes), whereas count(col1) is pretty fast (~10 seconds). When checking the explain plan it appears that the distinct count forces 64 'group by rollup'
Explain plan :
count(col1)
Plan hash value: 3126999781
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 288 | 8640 | 1316 (3)| 00:00:01 |
| 1 | SORT GROUP BY | | 288 | 8640 | 1316 (3)| 00:00:01 |
| 2 | GENERATE CUBE | | 288 | 8640 | 1316 (3)| 00:00:01 |
| 3 | SORT GROUP BY | | 288 | 8640 | 1316 (3)| 00:00:01 |
| 4 | TABLE ACCESS FULL| TMP_TEST_DATA | 668K| 19M| 1296 (1)| 00:00:01 |
count(distinct col_1
Plan hash value: 1939696204
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 288 | 29952 | 50234 (4)| 00:00:02 |
| 1 | TEMP TABLE TRANSFORMATION | | | | | |
| 2 | LOAD AS SELECT | SYS_TEMP_0FD9E9C98_2ACFFE0 | | | | |
| 3 | TABLE ACCESS FULL | TMP_TEST_DATA | 668K| 19M| 1296 (1)| 00:00:01 |
| 4 | LOAD AS SELECT | SYS_TEMP_0FD9E9C9A_2ACFFE0 | | | | |
| 5 | SORT GROUP BY ROLLUP | | 288 | 8640 | 765 (4)| 00:00:01 |
| 6 | TABLE ACCESS FULL | SYS_TEMP_0FD9E9C98_2ACFFE0 | 668K| 19M| 745 (1)| 00:00:01 |
| 7 | LOAD AS SELECT | SYS_TEMP_0FD9E9C9A_2ACFFE0 | | | | |
| 8 | SORT GROUP BY ROLLUP | | 204 | 6120 | 765 (4)| 00:00:01 |
| 9 | TABLE ACCESS FULL | SYS_TEMP_0FD9E9C98_2ACFFE0 | 668K| 19M| 745 (1)| 00:00:01 |
...
| 190 | LOAD AS SELECT | SYS_TEMP_0FD9E9C9A_2ACFFE0 | | | | |
| 191 | SORT GROUP BY ROLLUP | | 3 | 90 | 765 (4)| 00:00:01 |
| 192 | TABLE ACCESS FULL | SYS_TEMP_0FD9E9C98_2ACFFE0 | 668K| 19M| 745 (1)| 00:00:01 |
| 193 | LOAD AS SELECT | SYS_TEMP_0FD9E9C9A_2ACFFE0 | | | | |
| 194 | SORT GROUP BY ROLLUP | | 2 | 60 | 765 (4)| 00:00:01 |
| 195 | TABLE ACCESS FULL | SYS_TEMP_0FD9E9C98_2ACFFE0 | 668K| 19M| 745 (1)| 00:00:01 |
| 196 | SORT ORDER BY | | 288 | 29952 | 3 (34)| 00:00:01 |
| 197 | VIEW | | 288 | 29952 | 2 (0)| 00:00:01 |
| 198 | TABLE ACCESS FULL | SYS_TEMP_0FD9E9C9A_2ACFFE0 | 288 | 8640 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------
Is there a way to improve this ?
No, I do not see a way to improve this, if you really need exact count_distinct
results.
If you can live with an approximation, then using the function APPROX_COUNT_DISTINCT might be an option.
select col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9,
approx_count_distinct(col_1) approx_count_distinct
from t
group by cube (
col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
)
order by col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
;
I created this test table
CREATE TABLE t AS
SELECT round(abs(dbms_random.normal)*10,0) AS col_1,
round(dbms_random.VALUE(2,3),0) AS col_3,
round(dbms_random.VALUE(3,4),0) AS col_4,
round(dbms_random.VALUE(1,2),0) AS col_5,
round(dbms_random.VALUE(1,2),0) AS col_6,
round(dbms_random.VALUE(1,2),0) AS col_7,
round(dbms_random.VALUE(1,2),0) AS col_8,
round(dbms_random.VALUE(1,2),0) AS col_9
FROM xmltable('1 to 20000');
set the statistics_level to ALL
to gather detailed execution plan statistics
ALTER SESSION SET statistics_level = 'ALL';
executed the original query on table t
instead of tmp_test_data
select col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9,
count(distinct col_1) count_distinct
from t
group by cube (
col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
)
order by col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
;
to produced this result
COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9 COUNT_DISTINCT ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------------- 2 3 1 1 1 1 1 27 2 3 1 1 1 1 2 24 2 3 1 1 1 1 31 ... 2 40 41 2.187 rows selected.
and this execution plan.
--------------------------------------------------------------------------------------------------- | Id | Operation |Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | 1 | | 2187 |00:00:01.85 | 87 | | 1 | TEMP TABLE TRANSFORMATION | 1 | | 2187 |00:00:01.85 | 87 | | 2 | LOAD AS SELECT (CURSOR DURATION MEMORY)| 1 | | 0 |00:00:00.07 | 86 | | 3 | HASH GROUP BY | 1 | 464 | 3224 |00:00:00.02 | 85 | | 4 | TABLE ACCESS FULL | 1 | 20000 | 20000 |00:00:00.01 | 85 | | 5 | SORT ORDER BY | 1 | 16 | 2187 |00:00:01.77 | 0 | | 6 | VIEW | 1 | 408 | 2187 |00:00:01.75 | 0 | | 7 | VIEW | 1 | 408 | 2187 |00:00:01.73 | 0 | | 8 | UNION-ALL | 1 | | 2187 |00:00:01.72 | 0 | | 9 | SORT GROUP BY ROLLUP | 1 | 16 | 192 |00:00:00.03 | 0 | ... | 133 | SORT GROUP BY ROLLUP | 1 | 3 | 6 |00:00:00.03 | 0 | | 134 | TABLE ACCESS FULL | 1 | 464 | 3224 |00:00:00.01 | 0 | | 135 | SORT GROUP BY ROLLUP | 1 | 2 | 3 |00:00:00.02 | 0 | | 136 | TABLE ACCESS FULL | 1 | 464 | 3224 |00:00:00.01 | 0 | ---------------------------------------------------------------------------------------------------
Interesting is the column A-Rows
(actual number of rows) A-Time
(actual amount of time spent) and Buffers
(number of logical reads). We see that the query took 1.85 seconds for 87 logical I/Os. All 64 SORT GROUP BY ROLLUP
took 1.75 seconds, that's about 0.03 seconds per operation. Oracle needs to evaluate the number of distinct values of col_1 for each group-by-combination. There is no shortcut as in COUNT(col_1)
. That's why it is costly.
However, we could easily come up with an alternative query
WITH
combi AS (
SELECT col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
FROM t
GROUP BY CUBE (
col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
)
),
fullset AS (
SELECT t.col_1,
combi.col_3,
combi.col_4,
combi.col_5,
combi.col_6,
combi.col_7,
combi.col_8,
combi.col_9
FROM combi
JOIN t
ON (t.col_3 = combi.col_3 or combi.col_3 is null)
AND (t.col_4 = combi.col_4 or combi.col_4 is null)
AND (t.col_5 = combi.col_5 or combi.col_5 is null)
AND (t.col_6 = combi.col_6 or combi.col_6 is null)
AND (t.col_7 = combi.col_7 or combi.col_7 is null)
AND (t.col_8 = combi.col_8 or combi.col_8 is null)
AND (t.col_9 = combi.col_9 or combi.col_9 is null)
)
SELECT col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9,
COUNT(DISTINCT col_1) as count_distinct_col_1
FROM fullset
GROUP BY col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
ORDER BY col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9;
producing the same result
COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9 COUNT_DISTINCT_COL_1 ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------------------- 2 3 1 1 1 1 1 27 2 3 1 1 1 1 2 24 2 3 1 1 1 1 31 ... 2 40 41 2.187 rows selected.
with fewer lines in the execution plan.
------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 2187 |00:00:41.58 | 185K| | 1 | SORT GROUP BY | | 1 | 16 | 2187 |00:00:41.58 | 185K| | 2 | VIEW | VM_NWVW_1 | 1 | 464 | 67812 |00:00:41.54 | 185K| | 3 | HASH GROUP BY | | 1 | 464 | 67812 |00:00:41.54 | 185K| | 4 | NESTED LOOPS | | 1 | 2500 | 2560K|00:00:31.77 | 185K| | 5 | VIEW | | 1 | 16 | 2187 |00:00:00.37 | 85 | | 6 | SORT GROUP BY | | 1 | 16 | 2187 |00:00:00.36 | 85 | | 7 | GENERATE CUBE | | 1 | 16 | 16384 |00:00:00.27 | 85 | | 8 | SORT GROUP BY | | 1 | 16 | 128 |00:00:00.20 | 85 | | 9 | TABLE ACCESS FULL| T | 1 | 20000 | 20000 |00:00:00.10 | 85 | |* 10 | TABLE ACCESS FULL | T | 2187 | 156 | 2560K|00:00:13.09 | 185K| -------------------------------------------------------------------------------------------------
Let's look at operation 5. We produce all 2187 combinations within 0.37 seconds and need 85 logical I/Os to read the full table t
. Then we access the full table t
again for each of these 2187 combinations (see operation 4 and 10). The complete join
takes 31.77 seconds. The remaining group by
operations takes 9.77 seconds and the final sort
just 0.04. seconds.
This alternative query looks simple, but is much slower due to additional I/O operations necessary for the join of the named queries combi
and fullset
.
The original view is better in terms of I/O and runtime. Granted the execution plan looks extensive, but is efficient. In the end the DISTINCT
in COUNT(DISTINCT col_1)
is driving the complexity. It's just a word, but a complete different algorithm. Hence, I do not see how to improve the original query, if accurate results are important. However, if an approximation is good enough then using the function APPROX_COUNT_DISTINCT might be an option.
select col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9,
approx_count_distinct(col_1) approx_count_distinct
from t
group by cube (
col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
)
order by col_3,
col_4,
col_5,
col_6,
col_7,
col_8,
col_9
;
The results are similar
COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9 APPROX_COUNT_DISTINCT ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------------------- 2 3 1 1 1 1 1 27 2 3 1 1 1 1 2 24 2 3 1 1 1 1 31 ... 2 40 41 2.187 rows selected.
but execution plan is even more complex.
---------------------------------------------------------------------------------------------------- | Id | Operation | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | 1 | | 2187 |00:00:09.88 | 87 | | 1 | TEMP TABLE TRANSFORMATION | 1 | | 2187 |00:00:09.88 | 87 | | 2 | LOAD AS SELECT (CURSOR DURATION MEMORY)| 1 | | 0 |00:00:00.33 | 86 | | 3 | TABLE ACCESS FULL | 1 | 20000 | 20000 |00:00:00.08 | 85 | | 4 | LOAD AS SELECT (CURSOR DURATION MEMORY)| 1 | | 0 |00:00:00.16 | 0 | | 5 | SORT GROUP BY ROLLUP APPROX | 1 | 16 | 192 |00:00:00.16 | 0 | | 6 | TABLE ACCESS FULL | 1 | 20000 | 20000 |00:00:00.07 | 0 | ... | 190 | LOAD AS SELECT (CURSOR DURATION MEMORY)| 1 | | 0 |00:00:00.14 | 0 | | 191 | SORT GROUP BY ROLLUP APPROX | 1 | 3 | 6 |00:00:00.14 | 0 | | 192 | TABLE ACCESS FULL | 1 | 20000 | 20000 |00:00:00.07 | 0 | | 193 | LOAD AS SELECT (CURSOR DURATION MEMORY)| 1 | | 0 |00:00:00.14 | 0 | | 194 | SORT GROUP BY ROLLUP APPROX | 1 | 2 | 3 |00:00:00.14 | 0 | | 195 | TABLE ACCESS FULL | 1 | 20000 | 20000 |00:00:00.07 | 0 | | 196 | SORT ORDER BY | 1 | 16 | 2187 |00:00:00.01 | 0 | | 197 | VIEW | 1 | 16 | 2187 |00:00:00.01 | 0 | | 198 | TABLE ACCESS FULL | 1 | 16 | 2187 |00:00:00.01 | 0 | ----------------------------------------------------------------------------------------------------
and the query is slower than the original one. It is expected to be faster on large data sets. Therefore, I suggest to try APPROX_COUNT_DISTINCT
, if a 100% accuracy is not required.
ALL
To get the actual number of rows and the actual time spent in the execution plan, I've run all queries with statistics_level ALL
. This lead to a significant performance overhead (that's expected, see also Jonathan Lewis' blog about gather_plan_staticis ). When setting staticstics_level to TYPICAL
all queries run faster. Here are the runtimes in seconds incl. time for printing result on the client:
Query Runtime with 'ALL' Runtime with 'TYPICAL' ---------------- ------------------ ---------------------- Original (good) 2.615 0.977 Alternative (bad) 41.773 4.991 Approx_Count_Distinct 10.600 1.113
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.