Redshift UPDATE使用Seq扫描非常慢

Question

I have to update about 300 rows in a large table (600m rows) and I'm trying to make it faster. 我必须在一个大表（600m行）中更新约300行，并且我试图使其更快。

The query I am using is a bit tricky: 我正在使用的查询有点棘手：

UPDATE my_table
SET name = CASE WHEN (event_name in ('event_1', 'event_2', 'event_3')) 
THEN 'deleted' ELSE name END
WHERE uid IN ('id_1', 'id_2')

I try to use EXPLAIN on this query and I get: 我尝试在此查询上使用EXPLAIN，并且得到：

XN Seq Scan on my_table  (cost=0.00..103935.76 rows=4326 width=9838)
   Filter: (((uid)::text = 'id_1'::text) OR ((uid)::text = 'id_2'::text))

I have an interleaved sortkey, and uid is one of the columns included in this sortkey. 我有一个交错的排序键，而uid是此排序键中包含的列之一。 The reason for why the query looks like this is that in the real context the number of columns in SET (along with name) might vary, but it probably won't be more than 10. Basic idea is that I don't want cross join (update rules are specific to the columns, I don't want to mix them together). 查询看起来像这样的原因是，在实际情况下，SET中的列数（以及名称）可能会有所不同，但可能不会超过10。基本思想是我不想交叉连接（更新规则特定于列，我不想将它们混合在一起）。 For example in future there will be a query like: 例如，将来会有类似的查询：

UPDATE my_table
SET name = CASE WHEN (event_name in ("event_1", "event_2", "event_3")) THEN 'deleted' ELSE name END,
address = CASE WHEN (event_name in ("event_1", "event_4")) THEN 'deleted' ELSE address END
WHERE uid IN ("id_1", "id_2")

Anyway, back to the first query, it runs for a very long time (about 45 minutes) and takes 100% CPU. 无论如何，回到第一个查询，它会运行很长时间（大约45分钟），并占用100％的CPU。

I tried to check even simpler query: 我试图检查甚至更简单的查询：

explain UPDATE my_table SET name = 'deleted' WHERE uid IN ('id_1', 'id_2')
XN Seq Scan on my_table  (cost=0.00..103816.80 rows=4326 width=9821)
   Filter: (((uid)::text = 'id_1'::text) OR ((uid)::text = 'id_2'::text))

I don't know what else I can add to the question to make it more clear, would be happy to hear any advice. 我不知道我还可以在问题中添加些什么，以使其更清楚，我们很高兴听到任何建议。

Answer 1

Have you tried removing the interleaved sort key and replacing it with a simple sort key on uid or a compound sort key with uid as the first column? 您是否尝试过删除交错的排序键并用uid上的简单排序键或以uid作为第一列的复合排序键替换它？

Also, the name uid makes me think that you may being using a GUID/UUID as the value. 另外，名称uid使我认为您可能正在使用GUID / UUID作为值。 I would suggest that this is an anti-pattern for an id value in Redshift and especially for a sort key. 我建议这是Redshift中id值的反模式 ，尤其是对于排序键。

Problems with GUID/UUID id : GUID / UUID id ：

Do not occur in a predictable sequence 不要以可预测的顺序发生
- Often triggers a full sequential scan 通常触发完整的顺序扫描
- New rows always disrupt the sort 新行总是会破坏排序
Compress very poorly 压缩效果很差
- Requires more disk space for storage 需要更多磁盘空间来存储
- Requires more data to be read when queried 查询时需要读取更多数据

Answer 2

update in redshift is delete and then insert. redshift中的update是删除，然后插入。 Redshift by design just mark the rows as deleted and not deleting them physically(ghost rows). 根据设计，红移只是将行标记为已删除，而不是物理删除它们（虚拟行）。 Explicit vacuum delete only < table_name > required to reclaim space. 显式真空仅删除<table_name>即可回收空间。

Seq. 顺序 Scan impacted by these ghost rows. 扫描受这些幻影行影响。 Would suggest to run above command and check the performance of query later. 建议运行以上命令并稍后检查查询性能。

Redshift UPDATE使用Seq扫描非常慢

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-02-21 19:38:02

解决方案2
0 2017-04-08 23:27:55

Redshift UPDATE使用Seq扫描非常慢

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-02-21 19:38:02

解决方案2 0 2017-04-08 23:27:55

解决方案1
1 已采纳 2017-02-21 19:38:02

解决方案2
0 2017-04-08 23:27:55