SQL UNION ALL 消除重复

Question

I found this sample interview question and answer posted on toptal reproduced here.我发现在 toptal 上发布的这个样本面试问题和答案转载在这里。 But I don't really understand the code.但我真的不明白代码。 How can a UNION ALL turn into a UNIION (distinct) like that? UNION ALL 怎么能变成这样的UNIION（distinct）呢？ Also, why is this code faster?另外，为什么这段代码更快？

QUESTION题

Write a SQL query using UNION ALL (not UNION) that uses the WHERE clause to eliminate duplicates.使用 UNION ALL（而不是 UNION）编写 SQL 查询，该查询使用 WHERE 子句来消除重复项。 Why might you want to do this?你为什么要这样做？ Hide answer You can avoid duplicates using UNION ALL and still run much faster than UNION DISTINCT (which is actually same as UNION) by running a query like this:隐藏答案您可以使用 UNION ALL 避免重复，并且通过运行如下查询仍然比 UNION DISTINCT（实际上与 UNION 相同）运行得快得多：

ANSWER回答

SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X

The key is the AND a!=X part. This gives you the benefits of the UNION (aka, UNION DISTINCT) command, while avoiding much of its performance hit.

Answer 1

But in the example, the first query has a condition on column a , whereas the second query has a condition on column b .但在实施例中，第一查询具有上塔的条件a ，而第二查询具有上塔的条件b 。 This probably came from a query that's hard to optimize:这可能来自一个难以优化的查询：

SELECT * FROM mytable WHERE a=X OR b=Y

This query is hard to optimize with simple B-tree indexing.这个查询很难用简单的 B 树索引来优化。 Does the engine search an index on column a ?引擎是否在a列上搜索索引？ Or on column b ?还是在b列上？ Either way, searching the other term requires a table-scan.无论哪种方式，搜索另一个术语都需要进行表扫描。

Hence the trick of using UNION to separate into two queries for one term each.因此，使用 UNION 将每个查询分成两个查询的技巧。 Each subquery can use the best index for each search term.每个子查询可以为每个搜索词使用最佳索引。 Then combine the results using UNION.然后使用 UNION 合并结果。

But the two subsets may overlap, because some rows where b=Y may also have a=X in which case such rows occur in both subsets.但是这两个子集可能会重叠，因为b=Y某些行也可能有a=X在这种情况下，这些行出现在两个子集中。 Therefore you have to do duplicate elimination, or else see some rows twice in the final result.因此，您必须进行重复消除，否则在最终结果中会看到某些行两次。

SELECT * FROM mytable WHERE a=X 
UNION DISTINCT
SELECT * FROM mytable WHERE b=Y

UNION DISTINCT is expensive because typical implementations sort the rows to find duplicates. UNION DISTINCT ，因为典型的实现对行进行排序以查找重复项。 Just like if you use SELECT DISTINCT ... .就像你使用SELECT DISTINCT ... 。

We also have a perception that it's even more "wasted" work if the two subset of rows you are unioning have a lot of rows occurring in both subsets.我们还认为，如果您正在合并的两个行子集在两个子集中都出现了很多行，那么它会更加“浪费”工作。 It's a lot of rows to eliminate.要消除很多行。

But there's no need to eliminate duplicates if you can guarantee that the two sets of rows are already distinct.但是，如果您可以保证两组行已经不同，则无需消除重复项。 That is, if you guarantee there is no overlap.也就是说，如果你保证没有重叠。 If you can rely on that, then it would always be a no-op to eliminate duplicates, and therefore the query can skip that step, and therefore skip the costly sorting.如果您可以依赖它，那么消除重复项始终是无操作的，因此查询可以跳过该步骤，从而跳过代价高昂的排序。

If you change the queries so that they are guaranteed to select non-overlapping subsets of rows, that's a win.如果您更改查询以保证它们选择不重叠的行子集，那就是胜利。

SELECT * FROM mytable WHERE a=X 
UNION ALL 
SELECT * FROM mytable WHERE b=Y AND a!=X

These two sets are guaranteed to have no overlap.这两组保证没有重叠。 If the first set has rows where a=X and the second set has rows where a!=X then there can be no row that is in both sets.如果第一组有a=X行，而第二组有a!=X行，那么两个组中都不可能有行。

The second query therefore only catches some of the rows where b=Y , but any row where a=X AND b=Y is already included in the first set.因此，第二个查询仅捕获b=Y某些行，但a=X AND b=Y已包含在第一组中的任何行。

So the query achieves an optimized search for two OR terms, without producing duplicates, and requiring no UNION DISTINCT operation.因此该查询实现了对两个OR项的优化搜索，不会产生重复项，并且不需要UNION DISTINCT操作。

Answer 2

The question will be correct if the table has unique identifier - primary key.如果表具有唯一标识符 - 主键，则问题将是正确的。 Otherwise every select can return many the same rows.否则每个选择都可以返回许多相同的行。

To understand why it can faster let's look at how database executes UNION ALL and UNION.要了解为什么它可以更快，让我们看看数据库如何执行 UNION ALL 和 UNION。

The first is simple joining results from two independent queries.第一个是来自两个独立查询的简单连接结果。 These queries can be processed in parallel and taken to client one by one.这些查询可以并行处理并一一传递给客户端。

The second is joining + distinction.二是加入+区分。 To distinct records from 2 queries db needs to have all them in memory or if memory is not enough db needs to store them to temporary table and next select unique ones.要从 2 个查询中区分记录，db 需要将所有记录都保存在内存中，或者如果内存不够，db 需要将它们存储到临时表中，然后选择唯一的。 This is where performance degradation can be.这就是性能下降的地方。 DB's are pretty smart and distinction algorithms are developed good but for large result sets it could be a problem anyway. DB 非常聪明，区分算法开发得很好，但对于大型结果集，无论如何这可能是一个问题。

UNION ALL + additional WHERE condition can be faster if an index will be used while filtering.如果在过滤时使用索引，UNION ALL + 额外的 WHERE 条件可以更快。 So, here the performance magic.所以，这里的表演魔术。

Answer 3

I guess it will work我想它会起作用

select col1 From (
select row_number() over (partition by col1 order by col1) as b, col1 
from (
select col1  From u1
union all
select col1 From u2 ) a
) x
where x.b =1

Answer 4

This will also do the same trick:这也将执行相同的技巧：

select * from (
select * from table1
union all 
select * from table2
) a group by 
columns
having count(*) >= 1

or或者

select * from table1 
union all
select * from table2 b 
where not exists (select 1 from table1 a where a.col1 = b.col1)

Answer 5

The most simple way is like this, especially if you have many columns:最简单的方法是这样的，尤其是当你有很多列时：

SELECT *
  INTO table2
  FROM table1
  UNION
SELECT *
  FROM table1
  ORDER BY column1

Answer 6

I guest this is right (Oracle):我来宾这是对的（甲骨文）：

select distinct * from (

select * from test_a

union all

select * from test_b
);

SQL UNION ALL 消除重复

问题描述

6 个解决方案

解决方案1
10 已采纳 2017-01-18 22:45:58

解决方案2
0 2017-01-18 21:10:43

解决方案3
0 2017-08-09 13:01:52

解决方案4
0 2020-04-20 23:25:27

解决方案5
0 2021-03-01 16:24:48

解决方案6
0 2021-05-29 15:09:11

SQL UNION ALL 消除重复

问题描述

6 个解决方案

解决方案1 10 已采纳 2017-01-18 22:45:58

解决方案2 0 2017-01-18 21:10:43

解决方案3 0 2017-08-09 13:01:52

解决方案4 0 2020-04-20 23:25:27

解决方案5 0 2021-03-01 16:24:48

解决方案6 0 2021-05-29 15:09:11

解决方案1
10 已采纳 2017-01-18 22:45:58

解决方案2
0 2017-01-18 21:10:43

解决方案3
0 2017-08-09 13:01:52

解决方案4
0 2020-04-20 23:25:27

解决方案5
0 2021-03-01 16:24:48

解决方案6
0 2021-05-29 15:09:11