简体   繁体   English

Hash Group By 和 Sort Group By 在内部如何在 Oracle 中工作?

[英]How Hash Group By and Sort Group By internally work in Oracle?

I tried searching internal working of the Group By clause and found something about Hash Group By and Sort Group By but did not find their internal working.我尝试搜索 Group By 子句的内部工作,并找到有关 Hash Group By 和 Sort Group By 的内容,但没有找到它们的内部工作。 My question is How they internally work and what is the fundamental difference between them?我的问题是它们如何在内部工作,它们之间的根本区别是什么? What data Structure and algorithm they use?他们使用什么数据结构和算法?

Oracle's SQL Tuning Guide says in the PLAN_TABLE reference : Oracle 的 SQL 调整指南在PLAN_TABLE 参考中说:

HASH GROUP BY: Operation hashing a set of rows into groups for a query with a GROUP BY clause. HASH GROUP BY:使用 GROUP BY 子句将一组行散列到组中的操作。

SORT GROUP BY: Operation sorting a set of rows into groups for a query with a GROUP BY clause. SORT GROUP BY:对带有 GROUP BY 子句的查询将一组行分组的操作。

Hmm.唔。 Not very enlightening.不是很有启发性。 Sort merge join and hash join are documented in Doc Id 41954.1 , and I believe that the sort and hash part of GROUP BY works similar to JOIN . Sort merge join 和 hash join 记录在Doc Id 41954.1中,我相信GROUP BY的 sort 和 hash 部分类似于JOIN

Algorithm算法

Given an example table:给定一个示例表:

CREATE TABLE t2 (id VARCHAR2(30), amount NUMBER);
INSERT INTO t2 VALUES ('A', 10);
INSERT INTO t2 VALUES ('C',  5);
INSERT INTO t2 VALUES ('B',  1);
INSERT INTO t2 VALUES ('B',  2);
INSERT INTO t2 VALUES ('A',  3);
INSERT INTO t2 VALUES ('C',  1);
INSERT INTO t2 VALUES ('A',  7);

SELECT id, sum(amount)
  FROM t2
 GROUP BY id;

My understanding is that a SORT GROUP BY would sort the whole table (in memory or on disk) and then run the aggregate function, for instance SUM(amount):我的理解是 SORT GROUP BY 将对整个表进行排序(在 memory 或磁盘上),然后运行聚合 function,例如 SUM(数量):

ID   Amount SUM
A    10
A     3
A     7     10+3+7=20

B     1
B     2     1+2=3

C     5
C     1     5+1=6

Whereas a HASH GROUP BY would scan the table once, compute a hash value for each row, and put the row into a bucket (in memory or on disk):而 HASH GROUP BY 将扫描表一次,计算每行的 hash 值,并将行放入存储桶(在 memory 或磁盘上)

SELECT id, ora_hash(id, 4), amount from t2;

ID Bucket  Amount  Hash table
A     2      10    Bucket#2: A=10
C     4       5    Bucket#4: C=5
B     2       1    Bucket#2: A=10, B=1
B     2       2    Bucket#2: A=10, B=1+2
A     2       3    Bucket#2: A=10+3, B=1+2
C     4       1    Bucket#4: C=5+1
A     2       7    Bucket#2: A=10+3+7, B=1+2

After putting all the values into buckets, it needs to scan the hash table to calculate the aggregate:将所有值放入桶后,需要扫描 hash 表来计算聚合:

Bucket#2: A=10+3+7, B=1+2
Bucket#4: C=5+1

Performance表现

We need a bigger table to measure the performance:我们需要一个更大的表来衡量性能:

CREATE TABLE t AS 
SELECT RPAD(object_type, 3000, 'x') as gby, o.* 
  FROM all_objects o WHERE rownum <= 50000; COMMIT;
INSERT INTO t SELECT * FROM t; COMMIT;
EXEC dbms_stats.gather_table_stats(user, 't');

You can ask for a HASH GROUP BY with the hint USE_HASH_AGGREGATION:您可以使用提示 USE_HASH_AGGREGATION 请求 HASH GROUP BY:

SELECT /*+ USE_HASH_AGGREGATION */ gby, count(*)
  FROM t
 GROUP BY gby; 

Likewise for a SORT GROUP BY with the hint NO_USE_HASH_AGGREGATION:同样对于带有提示 NO_USE_HASH_AGGREGATION 的 SORT GROUP BY:

SELECT /*+ NO_USE_HASH_AGGREGATION */ gby, count(*)
  FROM t
 GROUP BY gby; 

If you dig out the SQL_IDs, you can inspect the amount of memory needed by each operation:如果您挖掘出 SQL_ID,您可以检查每个操作所需的 memory 的数量:

SELECT * FROM v$sql WHERE sql_text LIKE '%USE_HASH_AGGREGATION%'; 

SELECT * FROM v$sql_workarea WHERE sql_id IN ('663t56n1tdr59','fp5z7z1fyz42p');

OPERATION_TYPE  EST_OPT_SIZE LAST_MEM_USED ACTIVE_TIME MAX_TEMP
GROUP BY (HASH)       697344       1519616      325145        -
GROUP BY (SORT)       145408        129024      460975        -

So, GROUP BY HASH needed 1519616 bytes memory and ran in 0.325145 seconds, while GROUP BY SORT used less than an tenth of cache, but ran slightly longer.因此,GROUP BY HASH 需要 1519616 字节 memory 并在 0.325145 秒内运行,而 GROUP BY SORT 使用的缓存不到十分之一,但运行时间稍长。 Both ran completely in memory.两者都在 memory 中完全运行。

If it doesn't fit in memory and spills out to disk (which we can force here by lowering the memory limit artificially), the column max_tempseg_size is filled:如果它不适合 memory 并溢出到磁盘(我们可以在此处通过人为降低 memory 限制来强制),则填充 max_tempseg_size 列:

ALTER SESSION SET workarea_size_policy = MANUAL;
ALTER SESSION SET sort_area_size = 10000;

OPERATION_TYPE  EST_OPT_SIZE LAST_MEM_USED ACTIVE_TIME  MAX_TEMP
GROUP BY (HASH)       697344        623616    22756184 268435456
GROUP BY (SORT)       103424         43008     1064479   4194304

So, while spilling to disk, GROUP BY HASH needed 256 MB disk and ran in 22.7 seconds, while GROUP BY SORT needed only 4 MB disk and ran in 1.1 seconds.因此,当溢出到磁盘时,GROUP BY HASH 需要 256 MB 磁盘并在 22.7 秒内运行,而 GROUP BY SORT 仅需要 4 MB 磁盘并在 1.1 秒内运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM