简体   繁体   中英

PostgreSQL: What's an efficient way to update 3m records?

A dumbveloper at my work (years ago) moved the body column from our comments table to a secondary comment_extensions table as some sort of sketchy guesswork optimization. It seems ill-advised to do a join every time we want to display a comment, so I'm going to try moving that column back into our comments table and run some benchmarks.

My problem is that this update crawls. I let it run for an hour before shutting it off, fearing that it would take all night.

UPDATE comments SET body = comment_extensions.body 
                FROM comment_extensions 
                WHERE comments.id = comment_extensions.comment_id;

It's a PostgreSQL 8.1 database, and comment_extensions.comment_id is indexed.

Any suggestions for making this run faster?

Well, for an academic question, why is this ill-advised? What percentage of a lookup involves needing know the comment info?

My suggestion: update in small batches (10,000 rows at a time?). It may still take all night. Depending on the nature of your system, you may also have to implement cut-over logic that prevents the system from updating or pulling from your extensions table during this migration.

Large databases hurt like that ;)

How about this?

http://www.postgresql.org/docs/8.1/interactive/sql-createtableas.html

CREATE TABLE joined_comments
    AS SELECT c.id, c.author, c.blablabla, ce.body
    FROM comments c LEFT JOIN comment_extensions ce
    ON c.id = ce.comment_id;

That would create a new joined_comments table. That could be almost enough (you'd need to still recreate indexes and so on) but I remember Postgres 8.1 has a bug about the way serial columns get created (sorry can't find a link).

So my suggestion would be that after you have this new joined table, then you COPY TO a BINARY file from that joined_comments table, create a new comments table stating that id is a SERIAL right from the start, then COPY FROM that BINARY file to the new comments table. Then, recreate indexes.

You might get some benefit from disabling log while doing this. If it is a test in a non-production table, you probably don't need the protection a logfile gives you.

If there is an index or key on comments.body then drop it before the update and recreate it afterward.

Is the comments.body field a fixed width char(N) or is it a varchar? Varchar used to be slower than char(), and I suspect it still is. So use a char not varchar.

If you do a select that merges the data to a data file (say, quoted csv) and write a script to turn that into INSERTS, then empty the comments table and load it with INSERTS that might be faster than the query you have, though the index on comments.id is helping the speed.

3e6 records are going to take some time regardless.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM