简体   繁体   English

(Postgres)SQL:如何提供所有缺失的对?

[英](Postgres) SQL: How to supply all missing pairs?

Given a table that contains pairs of 'factors' and an exists flag: 给定一个包含成对的“因素”和exists标记的表:

create table pairs (
  factor_1  text,
  factor_2  text,
  exists    boolean
  );

and the following data (separators for readability): 和以下数据(用于分隔符的可读性):

 factor_1 | factor_2 | exists
----------+------------------
foo       | one      | t
foo       | two      | t
-----------------------------
bar       | three    | t
-----------------------------
baz       | four     | t
baz       | five     | t

how can I create a view that will show all possible pairs within the set of given factors: 如何创建一个视图,以显示给定因子集中的所有可能对:

 factor_1 | factor_2 | exists
----------+------------------
foo       | one      | t
foo       | two      | t
foo       | three    | f
foo       | four     | f
foo       | five     | f
-----------------------------
bar       | one      | f
bar       | two      | f
bar       | three    | t
bar       | four     | f
bar       | five     | f
-----------------------------
baz       | one      | f
baz       | two      | f
baz       | three    | f
baz       | four     | t
baz       | five     | t

I guess it will be possible to define a CTE / view that contains all distinct values of factor_1 , another that contains all distinct values of factor_2 , then take the cross product and set exists to true for all pairs that are found in table pairs . 我想这将是可能的,以限定一个CTE /视图,其中包含的所有不同值factor_1 ,另一个包含的所有不同值factor_2 ,然后取叉积和组exists于真对于在表中找到的所有对pairs Is there a more elegant / efficient / idiomatic way of achieving the same? 是否有更优雅/有效/惯用的方法来实现相同效果?

EDIT discussion of solutions: 编辑解决方案的讨论:

In the short time there was between asking the question and getting two answers for it, I went and implemented the solution I jotted down in the above. 在短时间内提出问题和得到两个答案之间,我去实现了上面提到的解决方案。 This is what it looks like; 这就是它的样子; it has 3 CTEs and an implicit cross join: 它具有3个CTE和一个隐式交叉联接:

with
  p1 as ( select distinct factor_1 from pairs  ),
  p2 as ( select distinct factor_2 from pairs  ),
  p3 as ( select *                 from p1, p2 )
  select
      p3.factor_1 as factor_1,
      p3.factor_2 as factor_2,
      ( case when p.exists then true else false end ) as exists
    from p3
    left join pairs as p on ( p3.factor_1 = p.factor_1 and p3.factor_2 = p.factor_2 )
    order by p3.factor_1, p3.factor_2;

Now let's compare that to the answers. 现在,让我们将其与答案进行比较。 I do a bit of reformatting and renaming to make all solutions differ only where it matters. 我进行了一些重新格式化和重命名,以使所有解决方案仅在重要的地方有所不同。

Solution A by Gordon Linoff is quite a bit shorter and makes do without CTEs: Gordon Linoff的解决方案A相当短,并且不需要CTE:

select
    f1.factor_1                 as factor_1,
    f2.factor_2                 as factor_2,
    coalesce( p.exists, false ) as exists
  from        ( select distinct factor_1 from pairs ) as p1
  cross join  ( select distinct factor_2 from pairs ) as p2
  left  join  pairs p
    on p.factor_1 = p1.factor_1 and p.factor_2 = p2.factor_2
    order by p1.factor_1, p2.factor_2;

Solution B by Valli is even a tad shorter; Valli的解决方案B甚至短了一点。 its insight is that it's the combinations from the cross join what should be unique, so the distinct keyword may be factored out to the top select : 它的洞察力在于,交叉连接的组合应该是唯一的,因此,可以将“ distinct关键字排除在顶部select

select distinct
    p1.factor_1                 as factor_1,
    p2.factor_2                 as factor_2,
    coalesce( p.exists, false ) as exists
  from        pairs as p1
  cross join  pairs as p2
  left  join  pairs as p
    on p1.factor_1 = p.factor_1 and p2.factor_2 = p.factor_2
    order by p1.factor_1, p2.factor_2;

My concern here is that the DB planner has to work harder to keep the cross join from getting inflated by too many repetive pairs that then get filtered out. 我在这里担心的是,数据库规划师必须更加努力地工作,以防止交叉连接被过多的重复对夸大,然后被过滤掉。 So I did explain analyze on all three solutions ( Note : I deleted the order by clauses); 因此,我确实explain analyze了所有三种解决方案的explain analyze注意 :我删除了order by子句); turns out the results are somewhat contradictory. 事实证明结果有些矛盾。 My solution with CTEs gets bad points because of the CTEs. 我的CTE解决方案由于CTE而变得不好。 I do use them a lot in my SQL because they're so handy, but they are also known to be optimization islands in PostgreSQL (akin to separate views), and it shows. 我确实在SQL中使用了很多它们,因为它们非常方便,但是它们也被称为PostgreSQL中的优化岛(类似于单独的视图),它显示了。

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=4770.47..5085.69 rows=40000 width=65) (actual time=0.167..0.189 rows=15 loops=1)
   Merge Cond: ((v3.factor_1 = p.factor_1) AND (v3.factor_2 = p.factor_2))
   CTE v1
     ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.026..0.028 rows=3 loops=1)
           Group Key: pairs.factor_1
           ->  Seq Scan on pairs  (cost=0.00..18.70 rows=870 width=32) (actual time=0.010..0.012 rows=5 loops=1)
   CTE v2
     ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.011..0.012 rows=5 loops=1)
           Group Key: pairs_1.factor_2
           ->  Seq Scan on pairs pairs_1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.003..0.005 rows=5 loops=1)
   CTE v3
     ->  Nested Loop  (cost=0.00..806.00 rows=40000 width=64) (actual time=0.044..0.062 rows=15 loops=1)
           ->  CTE Scan on v1  (cost=0.00..4.00 rows=200 width=32) (actual time=0.028..0.030 rows=3 loops=1)
           ->  CTE Scan on v2  (cost=0.00..4.00 rows=200 width=32) (actual time=0.005..0.007 rows=5 loops=3)
   ->  Sort  (cost=3857.54..3957.54 rows=40000 width=64) (actual time=0.118..0.123 rows=15 loops=1)
         Sort Key: v3.factor_1, v3.factor_2
         Sort Method: quicksort  Memory: 25kB
         ->  CTE Scan on v3  (cost=0.00..800.00 rows=40000 width=64) (actual time=0.046..0.074 rows=15 loops=1)
   ->  Sort  (cost=61.18..63.35 rows=870 width=65) (actual time=0.042..0.042 rows=5 loops=1)
         Sort Key: p.factor_1, p.factor_2
         Sort Method: quicksort  Memory: 25kB
         ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.005..0.008 rows=5 loops=1)
 Planning time: 0.368 ms
 Execution time: 0.421 ms
(24 rows)

Observe there are two sort s in this plan. 观察有两种sort在这个计划秒。

Solution A gets a much shorter plan (and a curiously high execution time): 解决方案A的计划要短得多(执行时间也很长):

                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=1580.25..2499.00 rows=40000 width=65) (actual time=1.048..2.197 rows=15 loops=1)
   Hash Cond: ((p.factor_1 = pairs.factor_1) AND (p.factor_2 = pairs_1.factor_2))
   ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.010..0.015 rows=5 loops=1)
   ->  Hash  (cost=550.25..550.25 rows=40000 width=64) (actual time=0.649..0.649 rows=15 loops=1)
         Buckets: 65536  Batches: 2  Memory Usage: 513kB
         ->  Nested Loop  (cost=41.75..550.25 rows=40000 width=64) (actual time=0.058..0.077 rows=15 loops=1)
               ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.033..0.036 rows=3 loops=1)
                     Group Key: pairs.factor_1
                     ->  Seq Scan on pairs  (cost=0.00..18.70 rows=870 width=32) (actual time=0.017..0.018 rows=5 loops=1)
               ->  Materialize  (cost=20.88..25.88 rows=200 width=32) (actual time=0.008..0.011 rows=5 loops=3)
                     ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.013..0.016 rows=5 loops=1)
                           Group Key: pairs_1.factor_2
                           ->  Seq Scan on pairs pairs_1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.004..0.006 rows=5 loops=1)
 Planning time: 0.258 ms
 Execution time: 2.342 ms
(15 rows)

Solution B's execution plan is much longer than solution A's, with several implicit sort s: 解决方案B的执行计划比解决方案A的执行计划长得多,具有多个隐式sort s:

                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=282354.48..289923.48 rows=80000 width=65) (actual time=0.230..0.251 rows=15 loops=1)
   ->  Sort  (cost=282354.48..284246.73 rows=756900 width=65) (actual time=0.229..0.233 rows=25 loops=1)
         Sort Key: p1.factor_1, p2.factor_2, (COALESCE(p."exists", false))
         Sort Method: quicksort  Memory: 26kB
         ->  Merge Left Join  (cost=140389.32..146354.17 rows=756900 width=65) (actual time=0.122..0.157 rows=25 loops=1)
               Merge Cond: ((p1.factor_1 = p.factor_1) AND (p2.factor_2 = p.factor_2))
               ->  Sort  (cost=140328.14..142220.39 rows=756900 width=64) (actual time=0.095..0.100 rows=25 loops=1)
                     Sort Key: p1.factor_1, p2.factor_2
                     Sort Method: quicksort  Memory: 26kB
                     ->  Nested Loop  (cost=0.00..9500.83 rows=756900 width=64) (actual time=0.027..0.043 rows=25 loops=1)
                           ->  Seq Scan on pairs p1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.010..0.011 rows=5 loops=1)
                           ->  Materialize  (cost=0.00..23.05 rows=870 width=32) (actual time=0.003..0.005 rows=5 loops=5)
                                 ->  Seq Scan on pairs p2  (cost=0.00..18.70 rows=870 width=32) (actual time=0.005..0.008 rows=5 loops=1)
               ->  Sort  (cost=61.18..63.35 rows=870 width=65) (actual time=0.021..0.023 rows=8 loops=1)
                     Sort Key: p.factor_1, p.factor_2
                     Sort Method: quicksort  Memory: 25kB
                     ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.004..0.004 rows=5 loops=1)
 Planning time: 0.260 ms
 Execution time: 0.333 ms
(19 rows)

I think we can forget about execution times with this short sample without indexes; 我想我们可以用这个没有索引的简短示例来忘记执行时间。 only with real data will we be able to tell those for sure. 只有有了真实的数据,我们才能确定这些。

Based on these results, I prefer solution A by Gordon Linoff, and the reason is that its SQL form is rather short while the execution plan is the most terse one. 基于这些结果,我更喜欢Gordon Linoff的解决方案A,原因是它的SQL格式相当短,而执行计划是最简洁的。 I am a bit wary of the opportunities for bad performance in solution B's execution plan, and my guess is also that while it's elegant to factor out the distinct clause to the uppermost level, it's not necessarily the most precise way of expression—I do not want to do a cross join and filter for unique pairs, I want to do a cross join on unique values. 我对解决方案B的执行计划中可能会出现的不良性能感到有些警觉,而且我的猜测是,虽然可以将最高级别的distinct子句排除在外,但它不一定是最精确的表达方式,尽管我很优雅。要进行交叉联接并过滤唯一对,我想对唯一值进行交叉联接。 Needless to say, in case the execution time relations (A: 2.3ms / B: 0.3ms) should turn out to manifest with realistic amounts of data—that would reverse my decision. 不用说,万一执行时间关系(A:2.3ms / B:0.3ms)可以显示出具有实际数据量的数据,那将颠倒我的决定。

Use a cross join to get the rows and a left join to get the boolean expression: 使用cross join获取行,使用left join获取布尔表达式:

select f1.factor_1, f2.factor_2, coalesce(p.exists, false) as exists
from (select distinct factor_1 from pairs) f1 cross join
     (select distinct factor_2 from pairs) f2 left join
     pairs p
     on p.factor_1 = f1.factor_1 and p.factor_2 = f2.factor_2;

Note: Although Postgres accepts exists as a column alias, I think it is a bad name because it conflicts with a SQL keyword. 注意:尽管Postgres接受以列别名的形式exists ,但我认为这是一个不好的名字,因为它与SQL关键字冲突。

We can use the distinct at the top instead of filtering for the distinct records in the from clause. 我们可以在顶部使用distinct,而不是对from子句中的distinct记录进行过滤。 Cross join the tables and then left join to fetch the exists column 交叉联接表,然后左联接以获取存在列

SELECT distinct p1.factor_1,
                p2.factor_2,
                coalesce(p.exists, false)
  FROM pairs p1 CROSS JOIN
       pairs p2 LEFT JOIN 
       pairs p ON
       p1.factor_1= p.factor_1 and
       p2.factor_2= p.factor_2

You dont need the LEFT JOIN+COALESCE, since EXISTS already yields a boolean value: 您不需要LEFT JOIN + COALESCE,因为EXISTS已经产生了一个布尔值:


SELECT f1.factor_1, f2.factor_2
  , EXISTS ( SELECT* pairs p
            WHERE p.factor_1 = f1.factor_1 AND p.factor_2 = f2.factor_2
            ) AS did_exist
FROM (SELECT DISTINCT factor_1 FROM pairs) f1
CROSS JOIN (SELECT DISTINCT factor_2 FROM pairs) f2 
    ;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM