简体   繁体   中英

How can I speed up this PostgreSQL UPDATE FROM sql query? It currently takes days to finish running

How can I speed up the PostgreSQL UPDATE FROM sql query below? It currently takes days to finish running.

UPDATE import_parts ip
SET part_part_id = pp.id
FROM parts.part_parts pp
WHERE pp.upc = ip.upc
AND (ip.status is null or ip.status != '6'); 

And why does it takes days to run in the first place?

Most of the time, I manually kill the query because it takes too long to run like more than 24 hours. Last time it successfully finished running, it took almost 38 hours.

import_parts table has 971971 rows

parts.part_parts table has 2196357 rows

parts.part_parts table has an index on upc and id is the primary key of the table.

I already tried running VACUUM ANALYZE on import_parts table and parts.part_parts table before the update query above runs but the query still takes too long to run, so I manually killed it after 30 minutes. I'm hoping to be able to run the query in under 30 minutes.

Here's the result of EXPLAIN when I run the query after running VACUUM ANALYZE on import_parts table and parts.part_parts table:

解释的结果

UPDATE 1:

I also tried setting enable_nestloop to off: SET enable_nestloop TO off

But the query still takes too long to run so I manually killed it. Here's the result of EXPLAIN when enable_nestloop is turned off:

关闭嵌套循环时 EXPLAIN 的结果

UPDATE 2:

Here's the result of EXPLAIN when using the query suggested by Abelisto on his answer to this post:

使用 Abelisto 的建议查询进行解释的结果

When I actually run the query though, I'm encountering this error:

ERROR: more than one row returned by a subquery used as an expression

I'm still figuring out how to fix the error.

First of all, try to rewrite your query like

UPDATE import_parts ip
SET part_part_id = (
  SELECT pp.id
  FROM parts.part_parts pp
  WHERE pp.upc = ip.upc)
WHERE status is null or status != '6'; 

Obviously it raises something like to

ERROR:  more than one row returned by a subquery used as an expression

Fix it using additionally conditions (subquery should to return exactly one or zero row for each row in the target table)

From what you say, it seems that upc is not unique in parts_parts . Try running this:

select upc, count(*)
from parts.parts_parts pp
group by upc
having count(*) > 1;

These duplicates are probably causing the performance problems. You could get around this by arbitrarily choosing a value, such as:

UPDATE import_parts ip
  SET part_part_id = pp.id
  FROM (SELECT pp.upc, MIN(pp.id) as id
        FROM parts.part_parts pp
        GROUP BY pp.upc
       ) pp
  WHERE pp.upc = ip.upc AND (ip.status is null or ip.status <> '6'); 
  1. Create an index with in import_parts with columns: upc,status.

  2. I will recomend you to split in two sentences:

I do't know your status, but i suppose you have status: null, 1, 2, 3, 4, 5, 6, 7

UPDATE import_parts ip
SET part_part_id = pp.id
FROM parts.part_parts pp
WHERE pp.upc = ip.upc
AND ip.status is null
;

UPDATE import_parts ip
SET part_part_id = pp.id
FROM parts.part_parts pp
WHERE pp.upc = ip.upc
AND ip.status IN(1, 2, 3, 4, 5, 7)
;
  1. Of course you need to change 1, 2, 3, 4, 5, 7 for your values(different from 6)

  2. I also like the answer of @Gordon Linoff, but it depends of how many rows do you have by upc

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM