I am attempting to update multiple columns on a table with values from another row in the same table:
CREATE TEMP TABLE person (
pid INT
,name VARCHAR(40)
,dob DATE
,younger_sibling_name VARCHAR(40)
,younger_sibling_dob DATE
);
INSERT INTO person VALUES (pid, name, dob)
(1, 'John', '1980-01-05'),
(2, 'Jimmy', '1975-04-25'),
(3, 'Sarah', '2004-02-10'),
(4, 'Frank', '1934-12-12');
The task is to populate younger_sibling_name
and younger_sibling_dob
with the name and birth day of the person that is closest to them in age, but not older or the same age.
I can set the younger sibling dob
easily because this is the value that determines the record to use with a correlated subquery (I think this is an example of that?):
UPDATE person SET younger_sibling_dob=(
SELECT MAX(dob)
FROM person AS sibling
WHERE sibling.dob < person.dob);
I just can't see any way to get the name
?
The real query of this will run over about 1M records in groups of 100-500 for each MAX selection so performance is a concern.
EDIT:
After trying many different approaches, I've decided on this one which I think is a good balance of being able to verify the data with the intermediate result, shows the intention of what the logic is, and performs adequately:
WITH sibling AS (
SELECT person.pid, sibling.dob, sibling.name,
row_number() OVER (PARTITION BY person.pid
ORDER BY sibling.dob DESC) AS age_closeness
FROM person
JOIN person AS sibling ON sibling.dob < person.dob
)
UPDATE person
SET younger_sibling_name = sibling.name
,younger_sibling_dob = sibling.dob
FROM sibling
WHERE person.pid = sibling.pid
AND sibling.age_closeness = 1;
SELECT * FROM person ORDER BY dob;
Correlated subqueries are infamous for abysmal performance. Doesn't matter much for small tables, matters a lot for big tables. Use one of these instead, preferably the second :
WITH cte AS (
SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
FROM person
)
UPDATE person p
SET younger_sibling_name = y.name
,younger_sibling_dob = y.dob
FROM cte x
JOIN (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk + 1
WHERE x.pid = p.pid;
-> SQLfiddle (with extended test case)
In the CTE cte
use the window function dense_rank()
to get a rank without gaps according to the dop
for every person.
Join cte
to itself, but remove duplicates on dob
from the second instance. Thereby everybody gets exactly one UPDATE
. If more than one person share the same dop
, the same one is selected as younger sibling for all persons on the next dob
. I do this with:
(SELECT DISTINCT ON (rnk) * FROM cte)
Add ORDER BY rnk, ...
if you want to pick a particular person for every dob
.
If no younger person exists, no UPDATE
happens and the columns stay NULL
.
Indices on dob
and pid
make this fast.
WITH cte AS (
SELECT dob, min(name) AS name
,row_number() OVER (ORDER BY dob) rn
FROM person p
GROUP BY dob
)
UPDATE person p
SET younger_sibling_name = y.name
,younger_sibling_dob = y.dob
FROM cte x
JOIN cte y ON y.rn = x.rn + 1
WHERE x.dob = p.dob;
This works, because aggregate functions are applied before window functions. And it should be very fast, since both operations agree on the sort order.
Obviates the need for a later DISTINCT
like in query 1.
Result is the same as query 1, exactly.
Again, you can add more columns to ORDER BY
to pick a particular person for every dob
.
Only needs an index on dob
to be fast.
1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
SELECT * FROM person nx
WHERE nx.dob < dst.dob
OR nx.dob = dst.dob AND nx.pid < dst.pid
AND nx.dob > src.dob
OR nx.dob = src.dob AND nx.pid > src.pid
);
2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM (
SELECT pid
, LAG(name) OVER win AS name
, LAG(dob) OVER win AS dob
FROM person
WINDOW win AS (ORDER BY dob, pid)
) src
WHERE src.pid = dst.pid
;
Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.
To get the dob and name, you can do:
update person
set younger_sibling_dob = (select dob
from person p2
where s.dob < person.dob
order by dob desc
limit 1),
younger_sibling_name = (select name
from person p2
where s.dob < person.dob
order by dob desc
limit 1)
If you have an index on dob
, then the query will run faster.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.