简体   繁体   中英

How to speed up a slow UPDATE query

I have the following UPDATE query:

UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;

Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.

The Indexer.Pages table has around 506,000 rows, and about 490,000 of them contain a value for LastError , so I doubt I can take advantage of any indexes here.

The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html . I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with just the Id and the html field, and keep Indexer.Pages as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.

Ideas? I'm using Postgres 9.0.0.

UPDATE:

Here's the schema:

CREATE TABLE indexer.pages
(
  id uuid NOT NULL,
  url character varying(1024) NOT NULL,
  firstcrawled timestamp with time zone NOT NULL,
  lastcrawled timestamp with time zone NOT NULL,
  recipeid uuid,
  html text NOT NULL,
  lasterror character varying(1024),
  missingings smallint,
  CONSTRAINT pages_pkey PRIMARY KEY (id ),
  CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);

I also have two indexes:

CREATE INDEX idx_indexer_pages_missingings
  ON indexer.pages
  USING btree
  (missingings )
  WHERE missingings > 0;

and

CREATE INDEX idx_indexer_pages_null
  ON indexer.pages
  USING btree
  (recipeid )
  WHERE NULL::boolean;

There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId .

What @kgrittn posted as comment is the best answer so far. I am merely filling in details.

Before you do anything else, you should upgrade PostgreSQL to a current version , at least to the last security release of your major version. See guidelines on the project.

I also want to stress what Kevin mentioned about indexes involving the column LastError . Normally, HOT updates can recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:

If your column is used in any index in any way , HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lot by deleting all of these indexes before you UPDATE and recreate them later.

In this context it would help to run multiple smaller UPDATEs: If ...
... the updated column is not involved in any indexes (enabling HOT updates). ... the UPDATE is easily divided into multiple patches in multiple transactions. ... the rows in those patches are spread out over the table (physically, not logically). ... there are no other concurrent transactions keeping dead tuples from being reused.

Then you would not need to VACCUUM in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previous transactions, not from the same or concurrent ones. You may want to schedule a VACUUM at the end of the operation, or just let auto-vacuuming do its job.

The same could be done with any other index that is not needed for the UPDATE - and judging from your numbers the UPDATE is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.

Also, your update is not likely to break any foreign key constraints . You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE you get an error when trying to recreate the FK. If you do it all within one transaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)

Lastly, disable & enable triggers that are not needed for the update.

Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.

So:

BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;

You cannot run VACUUM inside a transaction block. Per documentation:

VACUUM cannot be executed inside a transaction block.

You could split your operation into a few big chunks and run in between:

VACUUM ANALYZE tbl;

If you don't have to deal with concurrent transactions you could (even more effectively):

ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)

-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl;  -- optionally in between, or autovacuum kicks in

-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
UPDATE Indexer.Pages 
  SET LastError=NULL
  ;

The where clause is not needed since the NULL fields are already NULL, so it won't harm to set them to NULL again (I don't think this would affect performance significantly).

Given your number_of_rows = 500K and your table size=46G, I conclude that your average rowsize is 90KB. That is huge. Maybe you could move {unused, sparse} columns of your table to other tables?

Your theory is probably correct. Reading the full table (and then doing anything) is probably causing the slow-down.

Why don't you just create another table that has PageId and LastError? Initialize this with the data in the table you have now (which should take less than 93 minutes). Then, use the LastError from the new table.

At your leisure, you can remove LastError from your existing table.

By the way, I don't normally recommend keeping two copies of a column in two separate tables. In this case, though, you sound like you are stuck and need a way to proceed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM