简体   繁体   中英

(Postgres) SQL: How to supply all missing pairs?

Given a table that contains pairs of 'factors' and an exists flag:

create table pairs (
  factor_1  text,
  factor_2  text,
  exists    boolean
  );

and the following data (separators for readability):

 factor_1 | factor_2 | exists
----------+------------------
foo       | one      | t
foo       | two      | t
-----------------------------
bar       | three    | t
-----------------------------
baz       | four     | t
baz       | five     | t

how can I create a view that will show all possible pairs within the set of given factors:

 factor_1 | factor_2 | exists
----------+------------------
foo       | one      | t
foo       | two      | t
foo       | three    | f
foo       | four     | f
foo       | five     | f
-----------------------------
bar       | one      | f
bar       | two      | f
bar       | three    | t
bar       | four     | f
bar       | five     | f
-----------------------------
baz       | one      | f
baz       | two      | f
baz       | three    | f
baz       | four     | t
baz       | five     | t

I guess it will be possible to define a CTE / view that contains all distinct values of factor_1 , another that contains all distinct values of factor_2 , then take the cross product and set exists to true for all pairs that are found in table pairs . Is there a more elegant / efficient / idiomatic way of achieving the same?

EDIT discussion of solutions:

In the short time there was between asking the question and getting two answers for it, I went and implemented the solution I jotted down in the above. This is what it looks like; it has 3 CTEs and an implicit cross join:

with
  p1 as ( select distinct factor_1 from pairs  ),
  p2 as ( select distinct factor_2 from pairs  ),
  p3 as ( select *                 from p1, p2 )
  select
      p3.factor_1 as factor_1,
      p3.factor_2 as factor_2,
      ( case when p.exists then true else false end ) as exists
    from p3
    left join pairs as p on ( p3.factor_1 = p.factor_1 and p3.factor_2 = p.factor_2 )
    order by p3.factor_1, p3.factor_2;

Now let's compare that to the answers. I do a bit of reformatting and renaming to make all solutions differ only where it matters.

Solution A by Gordon Linoff is quite a bit shorter and makes do without CTEs:

select
    f1.factor_1                 as factor_1,
    f2.factor_2                 as factor_2,
    coalesce( p.exists, false ) as exists
  from        ( select distinct factor_1 from pairs ) as p1
  cross join  ( select distinct factor_2 from pairs ) as p2
  left  join  pairs p
    on p.factor_1 = p1.factor_1 and p.factor_2 = p2.factor_2
    order by p1.factor_1, p2.factor_2;

Solution B by Valli is even a tad shorter; its insight is that it's the combinations from the cross join what should be unique, so the distinct keyword may be factored out to the top select :

select distinct
    p1.factor_1                 as factor_1,
    p2.factor_2                 as factor_2,
    coalesce( p.exists, false ) as exists
  from        pairs as p1
  cross join  pairs as p2
  left  join  pairs as p
    on p1.factor_1 = p.factor_1 and p2.factor_2 = p.factor_2
    order by p1.factor_1, p2.factor_2;

My concern here is that the DB planner has to work harder to keep the cross join from getting inflated by too many repetive pairs that then get filtered out. So I did explain analyze on all three solutions ( Note : I deleted the order by clauses); turns out the results are somewhat contradictory. My solution with CTEs gets bad points because of the CTEs. I do use them a lot in my SQL because they're so handy, but they are also known to be optimization islands in PostgreSQL (akin to separate views), and it shows.

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=4770.47..5085.69 rows=40000 width=65) (actual time=0.167..0.189 rows=15 loops=1)
   Merge Cond: ((v3.factor_1 = p.factor_1) AND (v3.factor_2 = p.factor_2))
   CTE v1
     ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.026..0.028 rows=3 loops=1)
           Group Key: pairs.factor_1
           ->  Seq Scan on pairs  (cost=0.00..18.70 rows=870 width=32) (actual time=0.010..0.012 rows=5 loops=1)
   CTE v2
     ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.011..0.012 rows=5 loops=1)
           Group Key: pairs_1.factor_2
           ->  Seq Scan on pairs pairs_1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.003..0.005 rows=5 loops=1)
   CTE v3
     ->  Nested Loop  (cost=0.00..806.00 rows=40000 width=64) (actual time=0.044..0.062 rows=15 loops=1)
           ->  CTE Scan on v1  (cost=0.00..4.00 rows=200 width=32) (actual time=0.028..0.030 rows=3 loops=1)
           ->  CTE Scan on v2  (cost=0.00..4.00 rows=200 width=32) (actual time=0.005..0.007 rows=5 loops=3)
   ->  Sort  (cost=3857.54..3957.54 rows=40000 width=64) (actual time=0.118..0.123 rows=15 loops=1)
         Sort Key: v3.factor_1, v3.factor_2
         Sort Method: quicksort  Memory: 25kB
         ->  CTE Scan on v3  (cost=0.00..800.00 rows=40000 width=64) (actual time=0.046..0.074 rows=15 loops=1)
   ->  Sort  (cost=61.18..63.35 rows=870 width=65) (actual time=0.042..0.042 rows=5 loops=1)
         Sort Key: p.factor_1, p.factor_2
         Sort Method: quicksort  Memory: 25kB
         ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.005..0.008 rows=5 loops=1)
 Planning time: 0.368 ms
 Execution time: 0.421 ms
(24 rows)

Observe there are two sort s in this plan.

Solution A gets a much shorter plan (and a curiously high execution time):

                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=1580.25..2499.00 rows=40000 width=65) (actual time=1.048..2.197 rows=15 loops=1)
   Hash Cond: ((p.factor_1 = pairs.factor_1) AND (p.factor_2 = pairs_1.factor_2))
   ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.010..0.015 rows=5 loops=1)
   ->  Hash  (cost=550.25..550.25 rows=40000 width=64) (actual time=0.649..0.649 rows=15 loops=1)
         Buckets: 65536  Batches: 2  Memory Usage: 513kB
         ->  Nested Loop  (cost=41.75..550.25 rows=40000 width=64) (actual time=0.058..0.077 rows=15 loops=1)
               ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.033..0.036 rows=3 loops=1)
                     Group Key: pairs.factor_1
                     ->  Seq Scan on pairs  (cost=0.00..18.70 rows=870 width=32) (actual time=0.017..0.018 rows=5 loops=1)
               ->  Materialize  (cost=20.88..25.88 rows=200 width=32) (actual time=0.008..0.011 rows=5 loops=3)
                     ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.013..0.016 rows=5 loops=1)
                           Group Key: pairs_1.factor_2
                           ->  Seq Scan on pairs pairs_1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.004..0.006 rows=5 loops=1)
 Planning time: 0.258 ms
 Execution time: 2.342 ms
(15 rows)

Solution B's execution plan is much longer than solution A's, with several implicit sort s:

                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=282354.48..289923.48 rows=80000 width=65) (actual time=0.230..0.251 rows=15 loops=1)
   ->  Sort  (cost=282354.48..284246.73 rows=756900 width=65) (actual time=0.229..0.233 rows=25 loops=1)
         Sort Key: p1.factor_1, p2.factor_2, (COALESCE(p."exists", false))
         Sort Method: quicksort  Memory: 26kB
         ->  Merge Left Join  (cost=140389.32..146354.17 rows=756900 width=65) (actual time=0.122..0.157 rows=25 loops=1)
               Merge Cond: ((p1.factor_1 = p.factor_1) AND (p2.factor_2 = p.factor_2))
               ->  Sort  (cost=140328.14..142220.39 rows=756900 width=64) (actual time=0.095..0.100 rows=25 loops=1)
                     Sort Key: p1.factor_1, p2.factor_2
                     Sort Method: quicksort  Memory: 26kB
                     ->  Nested Loop  (cost=0.00..9500.83 rows=756900 width=64) (actual time=0.027..0.043 rows=25 loops=1)
                           ->  Seq Scan on pairs p1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.010..0.011 rows=5 loops=1)
                           ->  Materialize  (cost=0.00..23.05 rows=870 width=32) (actual time=0.003..0.005 rows=5 loops=5)
                                 ->  Seq Scan on pairs p2  (cost=0.00..18.70 rows=870 width=32) (actual time=0.005..0.008 rows=5 loops=1)
               ->  Sort  (cost=61.18..63.35 rows=870 width=65) (actual time=0.021..0.023 rows=8 loops=1)
                     Sort Key: p.factor_1, p.factor_2
                     Sort Method: quicksort  Memory: 25kB
                     ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.004..0.004 rows=5 loops=1)
 Planning time: 0.260 ms
 Execution time: 0.333 ms
(19 rows)

I think we can forget about execution times with this short sample without indexes; only with real data will we be able to tell those for sure.

Based on these results, I prefer solution A by Gordon Linoff, and the reason is that its SQL form is rather short while the execution plan is the most terse one. I am a bit wary of the opportunities for bad performance in solution B's execution plan, and my guess is also that while it's elegant to factor out the distinct clause to the uppermost level, it's not necessarily the most precise way of expression—I do not want to do a cross join and filter for unique pairs, I want to do a cross join on unique values. Needless to say, in case the execution time relations (A: 2.3ms / B: 0.3ms) should turn out to manifest with realistic amounts of data—that would reverse my decision.

Use a cross join to get the rows and a left join to get the boolean expression:

select f1.factor_1, f2.factor_2, coalesce(p.exists, false) as exists
from (select distinct factor_1 from pairs) f1 cross join
     (select distinct factor_2 from pairs) f2 left join
     pairs p
     on p.factor_1 = f1.factor_1 and p.factor_2 = f2.factor_2;

Note: Although Postgres accepts exists as a column alias, I think it is a bad name because it conflicts with a SQL keyword.

We can use the distinct at the top instead of filtering for the distinct records in the from clause. Cross join the tables and then left join to fetch the exists column

SELECT distinct p1.factor_1,
                p2.factor_2,
                coalesce(p.exists, false)
  FROM pairs p1 CROSS JOIN
       pairs p2 LEFT JOIN 
       pairs p ON
       p1.factor_1= p.factor_1 and
       p2.factor_2= p.factor_2

You dont need the LEFT JOIN+COALESCE, since EXISTS already yields a boolean value:


SELECT f1.factor_1, f2.factor_2
  , EXISTS ( SELECT* pairs p
            WHERE p.factor_1 = f1.factor_1 AND p.factor_2 = f2.factor_2
            ) AS did_exist
FROM (SELECT DISTINCT factor_1 FROM pairs) f1
CROSS JOIN (SELECT DISTINCT factor_2 FROM pairs) f2 
    ;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM