简体   繁体   English

在Oracle SQL的group by子句中放置许多列

[英]Put many columns in group by clause in Oracle SQL

In Oracle 11g database, Suppose we have table, CUSTOMER and PAYMENT as follows 在Oracle 11g数据库中,假设我们具有表, CUSTOMERPAYMENT如下

Customer 顾客

CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_AGE | CUSTOMER_CREATION_DATE
--------------------------------------------------------------------
001                     John             30              1 Jan 2017
002                     Jack             10              2 Jan 2017
003                      Jim             50              3 Jan 2017

Payment 付款

CUSTOMER_ID | PAYMENT_ID | PAYMENT_AMOUNT | 
-------------------------------------------
001                   900            100.00
001                   901            200.00
001                   902            300.00
003                   903            999.00

We want to write an SQL to get all columns from table CUSTOMER together with the sum of all payment of each customer. 我们想要编写一个SQL,以获取表CUSTOMER所有列以及每个客户的所有付款之和。 There are many possible ways to do this but I would like to ask which one of the following is better. 有很多可能的方法可以做到这一点,但我想问一下以下哪一种更好。

Solution 1 解决方案1

SELECT C.CUSTOMER_ID
, MAX(C.CUSTOMER_NAME) CUSTOMER_NAME
, MAX(C.CUSTOMER_AGE) CUSTOMER_AGE
, MAX(C.CUSTOMER_CREATION_DATE) CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) TOTAL_PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID;

Solution 2 解决方案2

SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID, C.CUSTOMER_NAME, C.CUSTOMER_AGE, C.CUSTOMER_CREATION_DATE

Please notice in Solution 1 that I use MAX not because I actually want the max results, but I because I want "ONE" row from the columns which I know are equal for all rows with the same CUSTOMER_ID 请注意,在解决方案1中 ,我之所以使用MAX不是因为我实际上想要的是最大结果,而是因为我希望我知道的列中的“ ONE”行对于所有具有相同CUSTOMER_ID行都是相等的

While in solution 2 , I avoid putting the misleading MAX in SELECT part by putting the columns in GROUP BY part instead. 解决方案2中 ,我通过将列放在GROUP BY部分中来避免将误导性的MAX放在SELECT部分中。

With my current knowledge, I prefer Solution 1 because it is more important to comprehend the logic in GROUP BY part than in the SELECT part. 以我目前的知识,我更喜欢解决方案1,因为理解GROUP BY部分中的逻辑比SELECT部分中的逻辑更重要。 I would put only a set of unique keys to express the intention of the query, so the application can infer the expected number of rows. 我将只放置一组唯一键来表达查询的意图,以便应用程序可以推断预期的行数。 But I don't know about the performance. 但是我不知道表现。

I ask this question because I am reviewing a code change of a big SQL that put 50 columns in the GROUP BY clause because the editor want avoid the MAX function in SELECT part. 我问这个问题是因为我正在审查一个大型SQL的代码更改,该代码更改在GROUP BY子句中放置了50列,因为编辑器希望避免SELECT部分中的MAX函数。 I know we can refactor the query in someway to avoid putting the irrelevant columns in both GROUP BY and SELECT part, but please discard that option because it will affect the application logic and require more time to do the test. 我知道我们可以以某种方式重构查询,以避免将无关的列放在GROUP BYSELECT部分中,但是请舍弃该选项,因为这会影响应用程序逻辑,并且需要更多时间进行测试。


Update 更新

I have just done the test on my big query in both versions as everyone suggested. 正如每个人所建议的,我已经在两个版本的大型查询中都进行了测试。 The query is complex, it has 69 lines involving more than 20 tables and the execution plan is more than 190 lines, so I think this is not the place to show it. 该查询很复杂,它有69行涉及20多个表,而执行计划则超过190行,所以我认为这不是显示它的地方。

My production data is quite small now, it has about 4000 customers and the query was run against the whole database. 我的生产数据现在很小,它有大约4000个客户,并且查询是针对整个数据库运行的。 Only table CUSTOMER and a few reference table has TABLE ACCESS FULL in the execution plan, the others tables have access by indexes. 在执行计划中,只有表CUSTOMER和一些引用表具有TABLE ACCESS FULL ,其他表则可以通过索引进行访问。 The execution plans for both versions have a little bit difference in join algorithm ( HASH GROUP BY vs SORT AGGREGATE ) on some part. 这两个版本的执行计划在某种程度上在连接算法( HASH GROUP BYSORT AGGREGATE )上有所不同。

Both versions use about 13 minutes, no significant difference. 两种版本都使用约13分钟,没有明显差异。

I also have done the test on the simplified versions similar to the SQL in the question. 我还对类似于SQL的简化版本进行了测试。 Both version has exactly the same execution plan and elapse time. 两种版本的执行计划和运行时间完全相同。

With the current information, I think the most reasonable answer is that it is unpredictable unless test to decide the quality of both versions as the optimizer will do the job. 根据当前的信息,我认为最合理的答案是它是不可预测的,除非通过测试来确定两个版本的质量,因为优化程序将完成这项工作。 I will very appreciate if anyone could give any information to convince or reject this idea. 如果有人能提供任何信息说服或拒绝这个想法,我将不胜感激。

Another option is 另一种选择是

SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, P.PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN (
 SELECT CUSTOMER_ID, SUM(PAYMENT_AMOUNT) PAYMENT_AMOUNT
 FROM PAYMENT 
 GROUP BY CUSTOMER_ID
) P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)

To decide which one of three is better just test them and see the execution plans. 要确定三者中哪一个更好,只需对其进行测试并查看执行计划。

Neither. 都不是。 Do the sum on payment, then join the results. 计算付款金额,然后加入结果。

select C.*, p.total_payment -- c.* gets all columns from table alias c without typing them all out
from Customer C
left join -- I've used left in case you want to include customers with no orders
(
select customer_id, sum(payment_amount) as total_payment
from Payment
group by customer_id
) p
on p.customer_id = c.customer_id

Solution 1 is costly. 解决方案1是昂贵的。

Even though optimizer could avoid the unecessary sorting, at some point you will be forced to add indexes/constraints over irrelevant columns to improve performance. 即使优化器可以避免不必要的排序,在某些时候您仍将被迫在不相关的列上添加索引/约束以提高性能。 Not a good practice in the long term. 从长远来看,这不是一个好习惯。

Solution 2 is the Oracle way. 解决方案2是Oracle方式。

Oracle documentation states that: Oracle文档指出:

GROUP BY clause must contain only aggregates or grouping columns GROUP BY子句必须仅包含聚合或分组列

Oracle engineers had valid reasons to do that, however this does not apply to other RDBMS where you can simply put GROUP BY c.customerID and all will be fine. Oracle工程师有充分的理由这样做,但是这不适用于其他RDBMS,在这些RDBMS中,您只需将GROUP BY c.customerID放入GROUP BY c.customerID ,一切都很好。

For the sake of code readability a --comment would be cheaper. 出于代码可读性--comment会更便宜。

In general, not embracing any platform principles would have a cost: more code, weird code, memory, disk space, performance, etc. 通常,不采用任何平台原则都会产生成本:更多代码,怪异代码,内存,磁盘空间,性能等。

In Solution 1 the query will repeat the MAX function for each column. 在解决方案1中,查询将为每一列重复MAX函数。 I don't know exactly how the MAX function works but I assume that it sorts all elements on the column than pick the first (best case scenario). 我不确切知道MAX函数的工作原理,但我假设它对列中的所有元素进行排序而不是选择第一个(最佳情况)。 It is kind of a time bomb, when your table gets bigger this query will get worst very fast. 这是一种定时炸弹,当您的表变大时,此查询将很快变得最糟。 So if you consern about performance you should pick the solution 2. It looks messier but will be better for the application. 因此,如果您考虑性能,则应该选择解决方案2。它看起来比较杂乱,但对应用程序更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM