簡體   English   中英

Oracle SQL:按多維數據集分組的不重復計數

[英]Oracle SQL : distinct count with group by cube

我的目標:

我想通過使用group by cube來清楚地計算每種可能的組合。

我使用的查詢:

select         col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                count(distinct col_1)    count_distinct
from            tmp_test_data
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

示例數據 (完整表有10萬行):

col3 col4 col5 col6 col7 col8 col9 count_distinct   
2    3    1    1    1    1    1    12
2    3    1    1    1    1         12
2    3    1    1    1    2    1    1
2    3    1    1    1    2    2    8
2    3    1    1    1    2         9
2    3    1    1    1         1    13
2    3    1    1    1         2    8
2    3    1    1    1              21
...

我面臨的問題:使用count(distinct col_1)影響查詢的性能(約10分鍾),而count(col1)相當快(約10秒)。 在檢查說明計划時,似乎不同的計數會強制64個“按匯總分組”

說明計划:

COUNT(COL1)

Plan hash value: 3126999781
| Id  | Operation            | Name          | Rows  | Bytes | Cost (%CPU)| Time     |

|   0 | SELECT STATEMENT     |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   1 |  SORT GROUP BY       |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   2 |   GENERATE CUBE      |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   3 |    SORT GROUP BY     |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| TMP_TEST_DATA |   668K|    19M|  1296   (1)| 00:00:01 |

計數(不同的col_1

Plan hash value: 1939696204

---------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |   288 | 29952 | 50234   (4)| 00:00:02 |
|   1 |  TEMP TABLE TRANSFORMATION |                            |       |       |            |          |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C98_2ACFFE0 |       |       |            |          |
|   3 |    TABLE ACCESS FULL       | TMP_TEST_DATA              |   668K|    19M|  1296   (1)| 00:00:01 |
|   4 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
|   5 |    SORT GROUP BY ROLLUP    |                            |   288 |  8640 |   765   (4)| 00:00:01 |
|   6 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
|   7 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
|   8 |    SORT GROUP BY ROLLUP    |                            |   204 |  6120 |   765   (4)| 00:00:01 |
|   9 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
...
| 190 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
| 191 |    SORT GROUP BY ROLLUP    |                            |     3 |    90 |   765   (4)| 00:00:01 |
| 192 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
| 193 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
| 194 |    SORT GROUP BY ROLLUP    |                            |     2 |    60 |   765   (4)| 00:00:01 |
| 195 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
| 196 |   SORT ORDER BY            |                            |   288 | 29952 |     3  (34)| 00:00:01 |
| 197 |    VIEW                    |                            |   288 | 29952 |     2   (0)| 00:00:01 |
| 198 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C9A_2ACFFE0 |   288 |  8640 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

有辦法改善嗎?

簡短答案

不,如果您確實需要精確的count_distinct結果,我看不出有任何方法可以改善這一點。

如果可以接受近似值,則可以選擇使用函數APPROX_COUNT_DISTINCT

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                approx_count_distinct(col_1)    approx_count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

更長的答案

我創建了這個測試表

CREATE TABLE t AS
SELECT round(abs(dbms_random.normal)*10,0) AS col_1,
       round(dbms_random.VALUE(2,3),0) AS col_3,
       round(dbms_random.VALUE(3,4),0) AS col_4,
       round(dbms_random.VALUE(1,2),0) AS col_5,
       round(dbms_random.VALUE(1,2),0) AS col_6,
       round(dbms_random.VALUE(1,2),0) AS col_7,
       round(dbms_random.VALUE(1,2),0) AS col_8,
       round(dbms_random.VALUE(1,2),0) AS col_9
  FROM xmltable('1 to 20000');

將statistics_level設置為ALL以收集詳細的執行計划統計信息

ALTER SESSION SET statistics_level = 'ALL';

在表t而不是tmp_test_data上執行了原始查詢

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                count(distinct col_1)    count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

產生這個結果

COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 COUNT_DISTINCT
---------- ---------- ---------- ---------- ---------- ---------- ---------- --------------
         2          3          1          1          1          1          1             27
         2          3          1          1          1          1          2             24
         2          3          1          1          1          1                        31
...
                                                                           2             40
                                                                                         41

2.187 rows selected.

和這個執行計划。

---------------------------------------------------------------------------------------------------
| Id  | Operation                                |Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |     1 |        |   2187 |00:00:01.85 |      87 |
|   1 |  TEMP TABLE TRANSFORMATION               |     1 |        |   2187 |00:00:01.85 |      87 |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|     1 |        |      0 |00:00:00.07 |      86 |
|   3 |    HASH GROUP BY                         |     1 |    464 |   3224 |00:00:00.02 |      85 |
|   4 |     TABLE ACCESS FULL                    |     1 |  20000 |  20000 |00:00:00.01 |      85 |
|   5 |   SORT ORDER BY                          |     1 |     16 |   2187 |00:00:01.77 |       0 |
|   6 |    VIEW                                  |     1 |    408 |   2187 |00:00:01.75 |       0 |
|   7 |     VIEW                                 |     1 |    408 |   2187 |00:00:01.73 |       0 |
|   8 |      UNION-ALL                           |     1 |        |   2187 |00:00:01.72 |       0 |
|   9 |       SORT GROUP BY ROLLUP               |     1 |     16 |    192 |00:00:00.03 |       0 |
...
| 133 |       SORT GROUP BY ROLLUP               |     1 |      3 |      6 |00:00:00.03 |       0 |
| 134 |        TABLE ACCESS FULL                 |     1 |    464 |   3224 |00:00:00.01 |       0 |
| 135 |       SORT GROUP BY ROLLUP               |     1 |      2 |      3 |00:00:00.02 |       0 |
| 136 |        TABLE ACCESS FULL                 |     1 |    464 |   3224 |00:00:00.01 |       0 |
---------------------------------------------------------------------------------------------------

有趣的是列A-Rows (實際行數), A-Time (實際花費的時間)和Buffers (邏輯讀取數)。 我們看到該查詢花費了87個邏輯I / O的1.85秒。 所有64個SORT GROUP BY ROLLUP花費了1.75秒,大約每個操作0.03秒。 Oracle需要評估每個組合的col_1不同值的數量。 沒有像COUNT(col_1)那樣的快捷方式。 這就是為什么它很昂貴。

但是,我們可以輕松地提出一個替代查詢

WITH
   combi AS (
      SELECT col_3, 
             col_4, 
             col_5, 
             col_6, 
             col_7, 
             col_8, 
             col_9
        FROM t
       GROUP BY CUBE (
                   col_3, 
                   col_4, 
                   col_5, 
                   col_6, 
                   col_7, 
                   col_8, 
                   col_9
                )
   ),
   fullset AS (
      SELECT t.col_1,
             combi.col_3, 
             combi.col_4, 
             combi.col_5, 
             combi.col_6, 
             combi.col_7, 
             combi.col_8, 
             combi.col_9
        FROM combi
        JOIN t
          ON     (t.col_3 = combi.col_3 or combi.col_3 is null)
             AND (t.col_4 = combi.col_4 or combi.col_4 is null)
             AND (t.col_5 = combi.col_5 or combi.col_5 is null)
             AND (t.col_6 = combi.col_6 or combi.col_6 is null)
             AND (t.col_7 = combi.col_7 or combi.col_7 is null)
             AND (t.col_8 = combi.col_8 or combi.col_8 is null)
             AND (t.col_9 = combi.col_9 or combi.col_9 is null)
   )
SELECT col_3, 
       col_4, 
       col_5, 
       col_6, 
       col_7, 
       col_8, 
       col_9,
       COUNT(DISTINCT col_1) as count_distinct_col_1
  FROM fullset
 GROUP BY col_3, 
          col_4, 
          col_5, 
          col_6, 
          col_7, 
          col_8, 
          col_9
 ORDER BY col_3, 
          col_4, 
          col_5, 
          col_6, 
          col_7, 
          col_8, 
          col_9;

產生相同的結果

COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 COUNT_DISTINCT_COL_1
---------- ---------- ---------- ---------- ---------- ---------- ---------- --------------------
         2          3          1          1          1          1          1                   27
         2          3          1          1          1          1          2                   24
         2          3          1          1          1          1                              31
...
                                                                           2                   40
                                                                                               41

2.187 rows selected.

執行計划中的行數較少。

-------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |      1 |        |   2187 |00:00:41.58 |     185K|
|   1 |  SORT GROUP BY            |           |      1 |     16 |   2187 |00:00:41.58 |     185K|
|   2 |   VIEW                    | VM_NWVW_1 |      1 |    464 |  67812 |00:00:41.54 |     185K|
|   3 |    HASH GROUP BY          |           |      1 |    464 |  67812 |00:00:41.54 |     185K|
|   4 |     NESTED LOOPS          |           |      1 |   2500 |   2560K|00:00:31.77 |     185K|
|   5 |      VIEW                 |           |      1 |     16 |   2187 |00:00:00.37 |      85 |
|   6 |       SORT GROUP BY       |           |      1 |     16 |   2187 |00:00:00.36 |      85 |
|   7 |        GENERATE CUBE      |           |      1 |     16 |  16384 |00:00:00.27 |      85 |
|   8 |         SORT GROUP BY     |           |      1 |     16 |    128 |00:00:00.20 |      85 |
|   9 |          TABLE ACCESS FULL| T         |      1 |  20000 |  20000 |00:00:00.10 |      85 |
|* 10 |      TABLE ACCESS FULL    | T         |   2187 |    156 |   2560K|00:00:13.09 |     185K|
-------------------------------------------------------------------------------------------------

讓我們看一下操作5。我們在0.37秒內生成所有2187個組合,並且需要85個邏輯I / O來讀取整個表t 然后,我們針對這2187個組合中的每個組合再次訪問完整表t (請參見操作4和10)。 完整的join需要31.77秒。 其余的group by操作需要9.77秒,最終sort只需0.04。 秒。

此替代查詢看起來很簡單,但是由於連接命名查詢combifullset所需的額外I / O操作而使速度慢fullset

原始視圖在I / O和運行時方面更好。 授予的執行計划看起來很廣泛,但是很有效。 到底DISTINCTCOUNT(DISTINCT col_1)正在駕駛的復雜性。 只是一個字,而是一個完全不同的算法。 因此,如果准確的結果很重要,我看不到如何改進原始查詢。 但是,如果近似值足夠好,則可以選擇使用函數APPROX_COUNT_DISTINCT

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                approx_count_distinct(col_1)    approx_count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

結果相似

COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 APPROX_COUNT_DISTINCT
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------------------
         2          3          1          1          1          1          1                    27
         2          3          1          1          1          1          2                    24
         2          3          1          1          1          1                               31
...
                                                                           2                    40
                                                                                                41

2.187 rows selected.

但是執行計划更加復雜。

----------------------------------------------------------------------------------------------------
| Id  | Operation                                | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |      1 |        |   2187 |00:00:09.88 |      87 |
|   1 |  TEMP TABLE TRANSFORMATION               |      1 |        |   2187 |00:00:09.88 |      87 |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.33 |      86 |
|   3 |    TABLE ACCESS FULL                     |      1 |  20000 |  20000 |00:00:00.08 |      85 |
|   4 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.16 |       0 |
|   5 |    SORT GROUP BY ROLLUP APPROX           |      1 |     16 |    192 |00:00:00.16 |       0 |
|   6 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
...
| 190 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.14 |       0 |
| 191 |    SORT GROUP BY ROLLUP APPROX           |      1 |      3 |      6 |00:00:00.14 |       0 |
| 192 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
| 193 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.14 |       0 |
| 194 |    SORT GROUP BY ROLLUP APPROX           |      1 |      2 |      3 |00:00:00.14 |       0 |
| 195 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
| 196 |   SORT ORDER BY                          |      1 |     16 |   2187 |00:00:00.01 |       0 |
| 197 |    VIEW                                  |      1 |     16 |   2187 |00:00:00.01 |       0 |
| 198 |     TABLE ACCESS FULL                    |      1 |     16 |   2187 |00:00:00.01 |       0 |
----------------------------------------------------------------------------------------------------

而且查詢的速度比原始查詢慢。 預期在大型數據集上會更快。 因此,如果不需要100%的准確性,我建議嘗試APPROX_COUNT_DISTINCT

Statistic_Level ALL運行時開銷

為了獲得實際的行數和執行計划中花費的實際時間,我已經使用statistics_level ALL運行所有查詢。 這會導致大量的性能開銷(這是可以預期的,另請參見Jonathan Lewis的有關collect_plan_staticis的博客 )。 將staticstics_level設置為TYPICAL所有查詢的運行速度都更快。 以下是以秒為單位的運行時間(含秒)。 在客戶端上打印結果的時間:

Query                  Runtime with 'ALL'  Runtime with 'TYPICAL' 
----------------       ------------------  ----------------------
Original (good)                     2.615                   0.977
Alternative (bad)                  41.773                   4.991
Approx_Count_Distinct              10.600                   1.113

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM