简体   繁体   English

在每个 GROUP BY 组中选择第一行?

[英]Select first row in each GROUP BY group?

As the title suggests, I'd like to select the first row of each set of rows grouped with a GROUP BY .正如标题所示,我想选择与GROUP BY分组的每组行的第一行。

Specifically, if I've got a purchases table that looks like this:具体来说,如果我有一个如下所示的purchases表:

SELECT * FROM purchases;

My Output:我的输出:

id ID customer顾客 total全部的
1 1 Joe 5 5
2 2 Sally莎莉 3 3
3 3 Joe 2 2
4 4 Sally莎莉 1 1

I'd like to query for the id of the largest purchase ( total ) made by each customer .我想查询每个customer进行的最大购买( total )的id Something like this:像这样的东西:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;

Expected Output:预期输出:

FIRST(id)第一(身份证) customer顾客 FIRST(total)第一(总)
1 1 Joe 5 5
2 2 Sally莎莉 3 3

DISTINCT ON is typically simplest and fastest for this in PostgreSQL .PostgreSQL中, DISTINCT ON通常是最简单和最快的。
(For performance optimization for certain workloads see below.) (有关某些工作负载的性能优化,请参见下文。)

SELECT DISTINCT ON (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:或者更短(如果不是很清楚)输出列的序数:

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

If total can be NULL, add NULLS LAST :如果total可以为 NULL,则添加NULLS LAST

...
ORDER  BY customer, total DESC NULLS LAST, id;

Works either way, but you'll want to match existing indexes无论哪种方式都可以,但您需要匹配现有索引

db<>fiddle here db<> 在这里摆弄

Major points要点

DISTINCT ON is a PostgreSQL extension of the standard, where only DISTINCT on the whole SELECT list is defined. DISTINCT ON是标准的 PostgreSQL 扩展,其中仅定义了整个SELECT列表上的DISTINCT

List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates.DISTINCT ON子句中列出任意数量的表达式,组合的行值定义重复项。 The manual: 手册:

Obviously, two rows are considered distinct if they differ in at least one column value.显然,如果两行在至少一个列值上不同,则它们被认为是不同的。 Null values are considered equal in this comparison.在此比较中,空值被视为相等。

Bold emphasis mine.大胆强调我的。

DISTINCT ON can be combined with ORDER BY . DISTINCT ON可以与ORDER BY结合使用。 Leading expressions in ORDER BY must be in the set of expressions in DISTINCT ON , but you can rearrange order among those freely. ORDER BY中的前导表达式必须在DISTINCT ON的表达式集中,但您可以自由地重新排列它们之间的顺序。 Example.例子。
You can add additional expressions to ORDER BY to pick a particular row from each group of peers.您可以向ORDER BY添加其他表达式以从每组对等点中选择特定行。 Or, as the manual puts it :或者,正如手册所说

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). DISTINCT ON表达式必须匹配最左边的ORDER BY表达式。 The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group. ORDER BY子句通常包含附加表达式,这些表达式确定每个DISTINCT ON组中行的所需优先级。

I added id as last item to break ties:我添加了id作为最后一项来打破关系:
"Pick the row with the smallest id from each group sharing the highest total ." “从每个共享最高total的组中选择id最小的行。”

To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another ORDER BY .要以与确定每组第一个的排序顺序不一致的方式对结果进行排序,您可以将上述查询嵌套在具有另一个ORDER BY的外部查询中。 Example.例子。

If total can be NULL, you most probably want the row with the greatest non-null value.如果total可以为 NULL,则您很可能需要具有最大非空值的行。 Add NULLS LAST like demonstrated.像演示的那样添加NULLS LAST See:看:

The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way: SELECT列表不受DISTINCT ONORDER BY中的表达式以任何方式约束:

  • You don't have to include any of the expressions in DISTINCT ON or ORDER BY .不必DISTINCT ONORDER BY中包含任何表达式。

  • You can include any other expression in the SELECT list.可以SELECT列表中包含任何其他表达式。 This is instrumental for replacing complex subqueries and aggregate / window functions.这有助于替换复杂的子查询和聚合/窗口函数。

I tested with Postgres versions 8.3 – 15. But the feature has been there at least since version 7.1, so basically always.我使用 Postgres 版本 8.3 - 15 进行了测试。但该功能至少从版本 7.1 开始就存在,所以基本上一直如此。

Index指数

The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:上述查询的完美索引将是一个跨所有三列的多列索引,以匹配的顺序和匹配的排序顺序:

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

May be too specialized.可能太专业了。 But use it if read performance for the particular query is crucial.但如果特定查询的读取性能至关重要,请使用它。 If you have DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is perfectly applicable.如果查询中有DESC NULLS LAST ,请在索引中使用相同的值,以便排序顺序匹配并且索引完全适用。

Effectiveness / Performance optimization有效性/性能优化

Weigh cost and benefit before creating tailored indexes for each query.在为每个查询创建定制索引之前权衡成本和收益。 The potential of above index largely depends on data distribution .上述指标的潜力很大程度上取决于数据分布

The index is used because it delivers pre-sorted data.使用索引是因为它提供了预排序的数据。 In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table.在 Postgres 9.2 或更高版本中,如果索引小于基础表,查询也可以从仅索引扫描中受益。 The index has to be scanned in its entirety, though.但是,必须完整地扫描索引。 Example.例子。

For few rows per customer (high cardinality in column customer ), this is very efficient.对于每个客户的customer列中的高基数),这是非常有效的。 Even more so if you need sorted output anyway.如果您仍然需要排序输出,则更是如此。 The benefit shrinks with a growing number of rows per customer.随着每个客户的行数增加,收益会缩小。
Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk.理想情况下,您有足够的work_mem来处理 RAM 中涉及的排序步骤,而不会溢出到磁盘。 But generally setting work_mem too high can have adverse effects.但通常将work_mem设置太高会产生不利影响。 Consider SET LOCAL for exceptionally big queries.对于异常大的查询,请考虑使用SET LOCAL Find how much you need with EXPLAIN ANALYZE .使用EXPLAIN ANALYZE找出您需要多少。 Mention of " Disk: " in the sort step indicates the need for more:在排序步骤中提到“磁盘: ”表示需要更多:

For many rows per customer (low cardinality in column customer ), a loose index scan (aka "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 14. (An implementation for index-only scans is in development for Postgres 15. See here and here .)对于每个客户的许多customer列中的低基数),松散的索引扫描(又名“跳过扫描”)会(很多)更有效,但是直到 Postgres 14 才实现。(仅索引扫描的实现是正在为 Postgres 15 开发。请参见此处此处。)
For now, there are faster query techniques to substitute for this.目前,有更快的查询技术可以替代它。 In particular if you have a separate table holding unique customers, which is the typical use case.特别是如果您有一个单独的表来保存唯一客户,这是典型的用例。 But also if you don't:但如果你不这样做:

Benchmarks基准

See separate answer.请参阅单独的答案。

On databases that support CTE and windowing functions :支持 CTE 和窗口函数的数据库上:

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rank
      FROM PURCHASES p)
 SELECT *
   FROM summary
 WHERE rank = 1

Supported by any database:任何数据库都支持:

But you need to add logic to break ties:但是您需要添加逻辑来打破关系:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total

Benchmarks基准

I tested the most interesting candidates:我测试了最有趣的候选人:

  • Initially with Postgres 9.4 and 9.5 .最初使用Postgres 9.49.5
  • Added accented tests for Postgres 13 later.稍后为Postgres 13添加了重音测试。

Basic test setup基本测试设置

Main table: purchases :主表: purchases

CREATE TABLE purchases (
  id          serial  -- PK constraint added below
, customer_id int     -- REFERENCES customer
, total       int     -- could be amount of money in Cent
, some_column text    -- to make the row bigger, more realistic
);

Dummy data (with some dead tuples), PK, index:虚拟数据(带有一些死元组)、PK、索引:

INSERT INTO purchases (customer_id, total, some_column)    -- 200k rows
SELECT (random() * 10000)::int             AS customer_id  -- 10k distinct customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9;  -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

customer table - used for optimized query: customer表 - 用于优化查询:

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM   purchases
GROUP  BY 1
ORDER  BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

In my second test for 9.5 I used the same setup, but with 100000 distinct customer_id to get few rows per customer_id .在我对 9.5 的第二次测试中,我使用了相同的设置,但使用 100000 个不同的customer_id来获得每个customer_id行。

Object sizes for table purchases餐桌purchases的对象大小

Basic setup: 200k rows in purchases , 10k distinct customer_id , avg.基本设置: purchases 200k 行,10k 不同customer_id ,平均。 20 rows per customer.每个客户 20 行。
For Postgres 9.5 I added a 2nd test with 86446 distinct customers - avg.对于 Postgres 9.5,我添加了第二个测试,有 86446 个不同的客户 - 平均。 2.3 rows per customer.每个客户 2.3 行。

Generated with a query taken from here:使用从此处获取的查询生成:

Gathered for Postgres 9.5:为 Postgres 9.5 收集:

               what                | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 20496384 | 20 MB        |           102
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    24576 | 24 kB        |             0
 table_size_incl_toast             | 20529152 | 20 MB        |           102
 indexes_size                      | 10977280 | 10 MB        |            54
 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
 live_rows_in_text_representation  | 13729802 | 13 MB        |            68
 ------------------------------    |          |              |
 row_count                         |   200045 |              |
 live_tuples                       |   200045 |              |
 dead_tuples                       |    19955 |              |

Queries查询

1. row_number() in CTE, ( see other answer ) 1. CTE 中的row_number() ,(见其他答案

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

2. row_number() in subquery (my optimization) 2. 子查询中的row_number() (我的优化)

SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3. DISTINCT ON ( see other answer ) 3. DISTINCT ON见其他答案

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4. rCTE with LATERAL subquery ( see here ) 4. 带有LATERAL子查询的 rCTE(见这里

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5. customer table with LATERAL ( see here ) 5. 带有LATERALcustomer表(见这里

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6. array_agg() with ORDER BY ( see other answer ) 6. 带有ORDER BYarray_agg()见其他答案

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

Results结果

Execution time for above queries with EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF , best of 5 runs to compare with warm cache.上述带有EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF最好是 5 次运行以与暖缓存进行比较。

All queries used an Index Only Scan on purchases2_3c_idx (among other steps).所有查询都在purchases2_3c_idx上使用了仅索引扫描(以及其他步骤)。 Some only to benefit from the smaller size of the index, others more effectively.有些只是从较小的索引中受益,有些则更有效。

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id A. Postgres 9.4 有 200k 行,每个customer_id约 20

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  -- !
5.  37.679 ms  -- winner
6. 189.495 ms

B. Same as A. with Postgres 9.5 B. 与 A. 相同,使用 Postgres 9.5

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  -- !
5.  33.944 ms  -- winner
6. 211.540 ms  

C. Same as B., but with ~ 2.3 rows per customer_id C. 与 B. 相同,但每个customer_id大约 2.3 行

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

Retest with Postgres 13 on 2021-08-11在 2021 年 8 月 11 日使用 Postgres 13 重新测试

Simplified test setup: no deleted rows, because VACUUM ANALYZE cleans the table completely for the simple case.简化的测试设置:没有删除的行,因为VACUUM ANALYZE为简单的情况完全清理了表。

Important changes for Postgres: Postgres 的重要变化:

  • General performance improvements.一般性能改进。
  • CTEs can be inlined since Postgres 12, so query 1. and 2. now perform mostly identical (same query plan). CTE 从 Postgres 12 开始可以内联,因此查询 1. 和 2. 现在执行几乎相同(相同的查询计划)。

D. Like B. ~ 20 rows per customer_id D. 像 B. ~ 每个 customer_id 20 行

1. 103 ms
2. 103 ms  
3.  23 ms  -- winner  
4.  71 ms  
5.  22 ms  -- winner
6.  81 ms  

db<>fiddle here db<> 在这里摆弄

E. Like C. ~ 2.3 rows per customer_id E. 像 C. ~ 每个 customer_id 2.3 行

1. 127 ms
2. 126 ms  
3.  36 ms  -- winner  
4. 620 ms  
5. 145 ms
6. 203 ms  

db<>fiddle here db<> 在这里摆弄

Accented tests with Postgres 13使用 Postgres 13 进行重音测试

1M rows , 10.000 vs. 100 vs. 1.6 rows per customer. 1M 行,每个客户 10.000 对 100 对 1.6 行。

F. with ~ 10.000 rows per customer F. 每个客户约 10.000 行

1. 526 ms
2. 527 ms  
3. 127 ms
4.   2 ms  -- winner !
5.   1 ms  -- winner !
6. 356 ms  

db<>fiddle here db<> 在这里摆弄

G. with ~ 100 rows per customer G. 每个客户约 100 行

1. 535 ms
2. 529 ms  
3. 132 ms
4. 108 ms  -- !
5.  71 ms  -- winner
6. 376 ms  

db<>fiddle here db<> 在这里摆弄

H. with ~ 1.6 rows per customer H. 每个客户约 1.6 行

1.  691 ms
2.  684 ms  
3.  234 ms  -- winner
4. 4669 ms
5. 1089 ms
6. 1264 ms  

db<>fiddle here db<> 在这里摆弄

Conclusions结论

  • DISTINCT ON uses the index effectively and typically performs best for few rows per group. DISTINCT ON有效地使用索引,并且通常对每组的行执行最佳。 And it performs decently even with many rows per group.即使每组有很多行,它也表现得很好。

  • For many rows per group, emulating an index skip scan with an rCTE performs best - second only to the query technique with a separate lookup table (if that's available).对于每组的许多行,使用 rCTE 模拟索引跳过扫描的效果最好 - 仅次于使用单独的查找表(如果可用)的查询技术。

  • The row_number() technique demonstrated in the currently accepted answer never wins any performance test .当前接受的答案中展示的row_number()技术从未赢得任何性能测试 Not then, not now.那时不是,现在不是。 It never comes even close to DISTINCT ON , not even when the data distribution is unfavorable for the latter.它永远不会接近DISTINCT ON ,即使数据分布不利于后者。 The only good thing about row_number() : it does not scale terribly, just mediocre. row_number()唯一的好处是:它的扩展性不是很大,只是平庸。

More benchmarks更多基准

Benchmark by "ogr" with 10M rows and 60k unique "customers" on Postgres 11.5 .Postgres 11.5上以1000 万行和 60k 唯一“客户”为基准进行基准测试。 Results are in line with what we have seen so far:结果与我们迄今为止所看到的一致:

Original (outdated) benchmark from 2011 2011 年的原始(过时)基准

I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.我在 65579 行的真实表上使用 PostgreSQL 9.1运行了三个测试,并且在所涉及的三列中的每一列上都有单列 btree 索引,并获得了 5 次运行的最佳执行时间
Comparing @OMGPonies' first query ( A ) to the above DISTINCT ON solution ( B ):@OMGPonies 的第一个查询 ( A ) 与上述DISTINCT ON解决方案( B ) 进行比较:

  1. Select the whole table, results in 5958 rows in this case.选择整个表,在这种情况下会产生 5958 行。
A: 567.218 ms
B: 386.673 ms
  1. Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.使用条件WHERE customer BETWEEN x AND y产生 1000 行。
A: 249.136 ms
B:  55.111 ms
  1. Select a single customer with WHERE customer = x .使用WHERE customer = x选择一个客户。
A:   0.143 ms
B:   0.072 ms

Same test repeated with the index described in the other answer:使用另一个答案中描述的索引重复相同的测试:

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  
1B: 193.547 ms

2A: 249.796 ms -- special index not used  
2B:  28.679 ms

3A:   0.120 ms  
3B:   0.048 ms

This is common problem, which already has well tested and highly optimized solutions .这是一个常见问题,已经有经过良好测试和高度优化的解决方案 Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions ).就我个人而言,我更喜欢Bill Karwin 的左连接解决方​​案原始帖子有很多其他解决方案)。

Note that bunch of solutions to this common problem can surprisingly be found in the one of most official sources, MySQL manual !请注意,可以在最官方的来源之一MySQL 手册中找到许多解决此常见问题的方法! See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column .请参阅常见查询示例 :: The Rows Hold the Group-wise Maximum of a certain Column

In Postgres you can use array_agg like this:在 Postgres 中,您可以像这样使用array_agg

SELECT  customer,
        (array_agg(id ORDER BY total DESC))[1],
        max(total)
FROM purchases
GROUP BY customer

This will give you the id of each customer's largest purchase.这将为您提供每个客户最大购买量的id

Some things to note:需要注意的一些事项:

  • array_agg is an aggregate function, so it works with GROUP BY . array_agg是一个聚合函数,因此它适用于GROUP BY
  • array_agg lets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. array_agg允许您指定一个仅限于自身的排序,因此它不会限制整个查询的结构。 There is also syntax for how you sort NULLs, if you need to do something different from the default.如果您需要执行与默认值不同的操作,还有用于对 NULL 进行排序的语法。
  • Once we build the array, we take the first element.一旦我们构建了数组,我们就获取第一个元素。 (Postgres arrays are 1-indexed, not 0-indexed). (Postgres 数组是 1 索引的,而不是 0 索引的)。
  • You could use array_agg in a similar way for your third output column, but max(total) is simpler.您可以以类似的方式对第三个输出列使用array_agg ,但max(total)更简单。
  • Unlike DISTINCT ON , using array_agg lets you keep your GROUP BY , in case you want that for other reasons.DISTINCT ON不同,使用array_agg可以让您保留GROUP BY ,以防您出于其他原因需要。

The solution is not very efficient as pointed by Erwin, because of presence of SubQs正如 Erwin 所指出的那样,该解决方案不是很有效,因为存在 SubQ

select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;

The Query:查询:

SELECT purchases.*
FROM purchases
LEFT JOIN purchases as p 
ON 
  p.customer = purchases.customer 
  AND 
  purchases.total < p.total
WHERE p.total IS NULL

HOW DOES THAT WORK!这是如何运作的! (I've been there) (我去过那儿)

We want to make sure that we only have the highest total for each purchase.我们希望确保每次购买只有最高的总数。


Some Theoretical Stuff (skip this part if you only want to understand the query)一些理论知识(如果您只想了解查询,请跳过此部分)

Let Total be a function T(customer,id) where it returns a value given the name and id To prove that the given total (T(customer,id)) is the highest we have to prove that We want to prove either让 Total 是一个函数 T(customer,id),它返回一个给定名称和 id 的值 为了证明给定的总数 (T(customer,id)) 是最高的,我们必须证明我们想要证明

  • ∀x T(customer,id) > T(customer,x) (this total is higher than all other total for that customer) ∀x T(customer,id) > T(customer,x)(这个总数高于该客户的所有其他总数)

OR或者

  • ¬∃x T(customer, id) < T(customer, x) (there exists no higher total for that customer) ¬∃x T(customer, id) < T(customer, x) (该客户没有更高的总数)

The first approach will need us to get all the records for that name which I do not really like.第一种方法需要我们获取我不太喜欢的那个名字的所有记录。

The second one will need a smart way to say there can be no record higher than this one.第二个需要一种聪明的方式来说明没有比这个更高的记录。


Back to SQL返回 SQL

If we left joins the table on the name and total being less than the joined table:如果我们离开连接表的名称和总数小于连接表:

LEFT JOIN purchases as p 
ON 
p.customer = purchases.customer 
AND 
purchases.total < p.total

we make sure that all records that have another record with the higher total for the same user to be joined:我们确保对于要加入的同一用户,所有具有另一条总数较高的记录的记录:

+--------------+---------------------+-----------------+------+------------+---------+
| purchases.id |  purchases.customer | purchases.total | p.id | p.customer | p.total |
+--------------+---------------------+-----------------+------+------------+---------+
|            1 | Tom                 |             200 |    2 | Tom        |     300 |
|            2 | Tom                 |             300 |      |            |         |
|            3 | Bob                 |             400 |    4 | Bob        |     500 |
|            4 | Bob                 |             500 |      |            |         |
|            5 | Alice               |             600 |    6 | Alice      |     700 |
|            6 | Alice               |             700 |      |            |         |
+--------------+---------------------+-----------------+------+------------+---------+

That will help us filter for the highest total for each purchase with no grouping needed:这将帮助我们过滤每次购买的最高总数,而无需分组:

WHERE p.total IS NULL
    
+--------------+----------------+-----------------+------+--------+---------+
| purchases.id | purchases.name | purchases.total | p.id | p.name | p.total |
+--------------+----------------+-----------------+------+--------+---------+
|            2 | Tom            |             300 |      |        |         |
|            4 | Bob            |             500 |      |        |         |
|            6 | Alice          |             700 |      |        |         |
+--------------+----------------+-----------------+------+--------+---------+

And that's the answer we need.这就是我们需要的答案。

I use this way (postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29我使用这种方式(仅限 postgresql): https ://wiki.postgresql.org/wiki/First/last_%28aggregate%29

-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
        sfunc    = public.first_agg,
        basetype = anyelement,
        stype    = anyelement
);

-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $2;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
        sfunc    = public.last_agg,
        basetype = anyelement,
        stype    = anyelement
);

Then your example should work almost as is:那么您的示例应该几乎可以正常工作:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;

CAVEAT: It ignore's NULL rows警告:它忽略了 NULL 行


Edit 1 - Use the postgres extension instead编辑 1 - 改用 postgres 扩展

Now I use this way: http://pgxn.org/dist/first_last_agg/现在我用这种方式:http: //pgxn.org/dist/first_last_agg/

To install on ubuntu 14.04:在 ubuntu 14.04 上安装:

apt-get install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'

It's a postgres extension that gives you first and last functions;这是一个 postgres 扩展,为您提供 first 和 last 功能; apparently faster than the above way.显然比上述方式更快。


Edit 2 - Ordering and filtering编辑 2 - 排序和过滤

If you use aggregate functions (like these), you can order the results, without the need to have the data already ordered:如果您使用聚合函数(如这些),则可以对结果进行排序,而无需对数据进行排序:

http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-AGGREGATES

So the equivalent example, with ordering would be something like:因此,具有排序的等效示例将类似于:

SELECT first(id order by id), customer, first(total order by id)
  FROM purchases
 GROUP BY customer
 ORDER BY first(total);

Of course you can order and filter as you deem fit within the aggregate;当然,您可以在聚合中按照您认为合适的方式订购和过滤; it's very powerful syntax.这是非常强大的语法。

Use ARRAY_AGG function for PostgreSQL , U-SQL , IBM DB2 , and Google BigQuery SQL :PostgreSQLU-SQLIBM DB2Google BigQuery SQL使用ARRAY_AGG函数:

SELECT customer, (ARRAY_AGG(id ORDER BY total DESC))[1], MAX(total)
FROM purchases
GROUP BY customer

In SQL Server you can do this:在 SQL Server 中,您可以这样做:

SELECT *
FROM (
SELECT ROW_NUMBER()
OVER(PARTITION BY customer
ORDER BY total DESC) AS StRank, *
FROM Purchases) n
WHERE StRank = 1

Explaination:Here Group by is done on the basis of customer and then order it by total then each such group is given serial number as StRank and we are taking out first 1 customer whose StRank is 1说明:这里的 Group by是根据客户完成的,然后按总数排序,然后每个这样的组都被赋予序列号为 StRank,我们取出前 1 个 StRank 为 1 的客户

Very fast solution非常快速的解决方案

SELECT a.* 
FROM
    purchases a 
    JOIN ( 
        SELECT customer, min( id ) as id 
        FROM purchases 
        GROUP BY customer 
    ) b USING ( id );

and really very fast if table is indexed by id:如果表是由 id 索引的,那么真的非常快:

create index purchases_id on purchases (id);

Snowflake/Teradata supports QUALIFY clause which works like HAVING for windowed functions: Snowflake/Teradata 支持QUALIFY子句,其作用类似于窗口函数的HAVING

SELECT id, customer, total
FROM PURCHASES
QUALIFY ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) = 1

In PostgreSQL, another possibility is to use the first_value window function in combination with SELECT DISTINCT :在 PostgreSQL 中,另一种可能性是将first_value窗口函数与SELECT DISTINCT结合使用:

select distinct customer_id,
                first_value(row(id, total)) over(partition by customer_id order by total desc, id)
from            purchases;

I created a composite (id, total) , so both values are returned by the same aggregate.我创建了一个复合(id, total) ,因此两个值都由同一个聚合返回。 You can of course always apply first_value() twice.您当然可以始终应用first_value()两次。

This way it work for me:这样它对我有用:

SELECT article, dealer, price
FROM   shop s1
WHERE  price=(SELECT MAX(s2.price)
              FROM shop s2
              WHERE s1.article = s2.article
              GROUP BY s2.article)
ORDER BY article;

Select highest price on each article选择每篇文章的最高价格

This is how we can achieve this by using windows function:这就是我们如何通过使用 windows 函数来实现这一点:

    create table purchases (id int4, customer varchar(10), total integer);
    insert into purchases values (1, 'Joe', 5);
    insert into purchases values (2, 'Sally', 3);
    insert into purchases values (3, 'Joe', 2);
    insert into purchases values (4, 'Sally', 1);
    
    select ID, CUSTOMER, TOTAL from (
    select ID, CUSTOMER, TOTAL,
    row_number () over (partition by CUSTOMER order by TOTAL desc) RN
    from purchases) A where RN = 1;

在此处输入图像描述

The accepted OMG Ponies' "Supported by any database" solution has good speed from my test.从我的测试来看,公认的 OMG Ponies 的“任何数据库支持”解决方案的速度都很好。

Here I provide a same-approach, but more complete and clean any-database solution.在这里,我提供了一种相同的方法,但更完整和更干净的任何数据库解决方案。 Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (eg purchase_payment_id) will be selected for the real matching rows in the purchase table.考虑关系(假设希望为每个客户只获取一行,甚至为每个客户的最大总数获取多条记录),并且将为购买表中的实际匹配行选择其他购买字段(例如 purchase_payment_id)。

Supported by any database:任何数据库都支持:

select * from purchase
join (
    select min(id) as id from purchase
    join (
        select customer, max(total) as total from purchase
        group by customer
    ) t1 using (customer, total)
    group by customer
) t2 using (id)
order by customer

This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.此查询相当快,尤其是当购买表上有一个复合索引(例如(客户,总计)时)。

Remark:评论:

  1. t1, t2 are subquery alias which could be removed depending on database. t1, t2 是子查询别名,可以根据数据库删除。

  2. Caveat : the using (...) clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to eg on t2.id = purchase.id etc. The USING syntax works in SQLite, MySQL and PostgreSQL.警告:截至 2017 年 1 月的此编辑,MS-SQL 和 Oracle db 目前不支持using (...)子句。您必须自己将其扩展为例如on t2.id = purchase.id等。 USING 语法适用于 SQLite、MySQL 和 PostgreSQL。

  • If you want to select any (by your some specific condition) row from the set of aggregated rows.如果您想从聚合行集中选择任何(根据您的特定条件)行。

  • If you want to use another ( sum/avg ) aggregation function in addition to max/min .如果您想使用除max/min之外的另一个( sum/avg )聚合函数。 Thus you can not use clue with DISTINCT ON因此,您不能使用DISTINCT ON的线索

You can use next subquery:您可以使用下一个子查询:

SELECT  
    (  
       SELECT **id** FROM t2   
       WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount )   
    ) id,  
    name,   
    MAX(amount) ma,  
    SUM( ratio )  
FROM t2  tf  
GROUP BY name

You can replace amount = MAX( tf.amount ) with any condition you want with one restriction: This subquery must not return more than one row您可以将amount = MAX( tf.amount )替换为您想要的任何条件,但有一个限制:此子查询不得返回多于一行

But if you wanna to do such things you probably looking for window functions但是如果你想做这样的事情,你可能会寻找窗口功能

For SQl Server the most efficient way is:对于 SQl Server,最有效的方法是:

with
ids as ( --condition for split table into groups
    select i from (values (9),(12),(17),(18),(19),(20),(22),(21),(23),(10)) as v(i) 
) 
,src as ( 
    select * from yourTable where  <condition> --use this as filter for other conditions
)
,joined as (
    select tops.* from ids 
    cross apply --it`s like for each rows
    (
        select top(1) * 
        from src
        where CommodityId = ids.i 
    ) as tops
)
select * from joined

and don't forget to create clustered index for used columns并且不要忘记为使用的列创建聚集索引

My approach via window function dbfiddle :我通过窗口函数dbfiddle的方法:

  1. Assign row_number at each group: row_number() over (partition by agreement_id, order_id ) as nrow在每个组中分配row_numberrow_number() over (partition by agreement_id, order_id ) as nrow
  2. Take only first row at group: filter (where nrow = 1)仅取组中的第一行: filter (where nrow = 1)
with intermediate as (select 
 *,
 row_number() over ( partition by agreement_id, order_id ) as nrow,
 (sum( suma ) over ( partition by agreement_id, order_id ))::numeric( 10, 2) as order_suma,
from <your table>)

select 
  *,
  sum( order_suma ) filter (where nrow = 1) over (partition by agreement_id)
from intermediate

This can be achieved easily by MAX FUNCTION on total and GROUP BY id and customer.这可以通过对 total 和 GROUP BY id 和 customer 的 MAX FUNCTION 轻松实现。

SELECT id, customer, MAX(total) FROM  purchases GROUP BY id, customer
ORDER BY total DESC;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM