简体   繁体   中英

Redshift UPDATE uses Seq Scan very slow

I have to update about 300 rows in a large table (600m rows) and I'm trying to make it faster.

The query I am using is a bit tricky:

UPDATE my_table
SET name = CASE WHEN (event_name in ('event_1', 'event_2', 'event_3')) 
THEN 'deleted' ELSE name END
WHERE uid IN ('id_1', 'id_2')

I try to use EXPLAIN on this query and I get:

XN Seq Scan on my_table  (cost=0.00..103935.76 rows=4326 width=9838)
   Filter: (((uid)::text = 'id_1'::text) OR ((uid)::text = 'id_2'::text))

I have an interleaved sortkey, and uid is one of the columns included in this sortkey. The reason for why the query looks like this is that in the real context the number of columns in SET (along with name) might vary, but it probably won't be more than 10. Basic idea is that I don't want cross join (update rules are specific to the columns, I don't want to mix them together). For example in future there will be a query like:

UPDATE my_table
SET name = CASE WHEN (event_name in ("event_1", "event_2", "event_3")) THEN 'deleted' ELSE name END,
address = CASE WHEN (event_name in ("event_1", "event_4")) THEN 'deleted' ELSE address END
WHERE uid IN ("id_1", "id_2")

Anyway, back to the first query, it runs for a very long time (about 45 minutes) and takes 100% CPU.

I tried to check even simpler query:

explain UPDATE my_table SET name = 'deleted' WHERE uid IN ('id_1', 'id_2')
XN Seq Scan on my_table  (cost=0.00..103816.80 rows=4326 width=9821)
   Filter: (((uid)::text = 'id_1'::text) OR ((uid)::text = 'id_2'::text))

I don't know what else I can add to the question to make it more clear, would be happy to hear any advice.

Have you tried removing the interleaved sort key and replacing it with a simple sort key on uid or a compound sort key with uid as the first column?

Also, the name uid makes me think that you may being using a GUID/UUID as the value. I would suggest that this is an anti-pattern for an id value in Redshift and especially for a sort key.

Problems with GUID/UUID id :

  • Do not occur in a predictable sequence
    • Often triggers a full sequential scan
    • New rows always disrupt the sort
  • Compress very poorly
    • Requires more disk space for storage
    • Requires more data to be read when queried

update in redshift is delete and then insert. Redshift by design just mark the rows as deleted and not deleting them physically(ghost rows). Explicit vacuum delete only < table_name > required to reclaim space.

Seq. Scan impacted by these ghost rows. Would suggest to run above command and check the performance of query later.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM