Update PostgreSQL table with values from self

Question

I am attempting to update multiple columns on a table with values from another row in the same table:

CREATE TEMP TABLE person (
  pid INT
 ,name VARCHAR(40)
 ,dob DATE
 ,younger_sibling_name VARCHAR(40)
 ,younger_sibling_dob DATE
);

INSERT INTO person VALUES (pid, name, dob)
(1, 'John', '1980-01-05'),
(2, 'Jimmy', '1975-04-25'),
(3, 'Sarah', '2004-02-10'),
(4, 'Frank', '1934-12-12');

The task is to populate younger_sibling_name and younger_sibling_dob with the name and birth day of the person that is closest to them in age, but not older or the same age.

I can set the younger sibling dob easily because this is the value that determines the record to use with a correlated subquery (I think this is an example of that?):

UPDATE person SET younger_sibling_dob=(
SELECT MAX(dob)
FROM person AS sibling
WHERE sibling.dob < person.dob);

I just can't see any way to get the name ?
The real query of this will run over about 1M records in groups of 100-500 for each MAX selection so performance is a concern.

EDIT:

After trying many different approaches, I've decided on this one which I think is a good balance of being able to verify the data with the intermediate result, shows the intention of what the logic is, and performs adequately:

WITH sibling AS (
  SELECT person.pid, sibling.dob, sibling.name,
         row_number() OVER (PARTITION BY person.pid
                            ORDER BY sibling.dob DESC) AS age_closeness
  FROM person
  JOIN person AS sibling ON sibling.dob < person.dob
)
UPDATE person
  SET younger_sibling_name = sibling.name
     ,younger_sibling_dob  = sibling.dob
FROM sibling
WHERE person.pid = sibling.pid
   AND sibling.age_closeness = 1;

SELECT * FROM person ORDER BY dob;

Answer 1

Correlated subqueries are infamous for abysmal performance. Doesn't matter much for small tables, matters a lot for big tables. Use one of these instead, preferably the second :

Query 1

WITH cte AS (
   SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
   FROM   person
    )
UPDATE person p
SET    younger_sibling_name = y.name
      ,younger_sibling_dob  = y.dob
FROM   cte x
JOIN   (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk + 1
WHERE  x.pid = p.pid;

-> SQLfiddle (with extended test case)

In the CTE cte use the window function dense_rank() to get a rank without gaps according to the dop for every person.
Join cte to itself, but remove duplicates on dob from the second instance. Thereby everybody gets exactly one UPDATE . If more than one person share the same dop , the same one is selected as younger sibling for all persons on the next dob . I do this with:
```
 (SELECT DISTINCT ON (rnk) * FROM cte) 
```
Add ORDER BY rnk, ... if you want to pick a particular person for every dob .
If no younger person exists, no UPDATE happens and the columns stay NULL .
Indices on dob and pid make this fast.

Query 2

WITH cte AS (
   SELECT dob, min(name) AS name
         ,row_number() OVER (ORDER BY dob) rn
   FROM   person p
   GROUP  BY dob
   )
UPDATE person p
SET    younger_sibling_name = y.name
      ,younger_sibling_dob  = y.dob
FROM   cte x
JOIN   cte y ON y.rn = x.rn + 1
WHERE  x.dob = p.dob;

-> SQLfiddle

This works, because aggregate functions are applied before window functions. And it should be very fast, since both operations agree on the sort order.
Obviates the need for a later DISTINCT like in query 1.
Result is the same as query 1, exactly.
Again, you can add more columns to ORDER BY to pick a particular person for every dob .
Only needs an index on dob to be fast.

Answer 2

1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
   OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
        SELECT * FROM person nx
        WHERE nx.dob < dst.dob
           OR nx.dob = dst.dob AND nx.pid < dst.pid
        AND nx.dob > src.dob
           OR nx.dob = src.dob AND nx.pid > src.pid
        );

2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM    (
        SELECT pid
        , LAG(name) OVER win AS name
        , LAG(dob) OVER win AS dob 
        FROM person
        WINDOW win AS (ORDER BY dob, pid)
        ) src
WHERE src.pid = dst.pid
        ;

Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.

Answer 3

To get the dob and name, you can do:

update person
    set younger_sibling_dob = (select dob
                               from person p2
                               where s.dob < person.dob
                               order by dob desc
                               limit 1),
       younger_sibling_name = (select name
                               from person p2
                               where s.dob < person.dob
                               order by dob desc
                               limit 1)

If you have an index on dob , then the query will run faster.

Update PostgreSQL table with values from self

Question

3 answers

solution1
4 ACCPTED 2013-03-20 01:26:21

Query 1

Query 2

solution2
2 2013-03-20 17:37:08

solution3
1 2013-03-19 22:56:12

Update PostgreSQL table with values from self

Question

3 answers

solution1 4 ACCPTED 2013-03-20 01:26:21

Query 1

Query 2

solution2 2 2013-03-20 17:37:08

solution3 1 2013-03-19 22:56:12

solution1
4 ACCPTED 2013-03-20 01:26:21

solution2
2 2013-03-20 17:37:08

solution3
1 2013-03-19 22:56:12