根据Redshift中的关键列更新或插入

Question

I am loading CSV files to Redshift daily. 我每天都将CSV文件加载到Redshift。 To handle duplicates i am loading the files to staging table and then using Update or Insert scripts based on key columns to load to the target table. 为了处理重复项，我将文件加载到登台表，然后使用基于键列的更新或插入脚本加载到目标表。 Recently i found duplicate data in the target table unexpectedly. 最近，我在目标表中意外发现重复数据。

I double checked my script and don't see any reason for having duplicates. 我仔细检查了我的脚本，没有发现重复的任何原因。 Below are the Update and Insert script formats that i am using. 以下是我正在使用的更新和插入脚本格式。

For Inserting: 对于插入：

      Insert into target (key1, key2, col3, col4)
      Select key1, key2, col3, col4 
      From stage s where not exists (select 1 from target t
                        where s.key1 = t.key1 and)
                        s.key2 = t.key2);

And for update: 并进行更新：

      Update target Set
          key1=s.key1, key2=s.key2, col3=s.col3, col4=s.col4
      From stage s where target.key1=s.key1 and target.key2=s.key2;

Any help is appreciated. 任何帮助表示赞赏。

Answer 1

I ran into this too. 我也遇到了这个。 The problem was in the insert...select... where the select itself produced duplicates. 问题出在插入...选择...中，选择本身会产生重复。 One solution for us was to use a cursor (outside of Redshift) to run the select and insert one record at a time, but this proved to have performance issues. 对我们来说，一种解决方案是使用游标（在Redshift之外）运行一次select并一次插入一条记录，但这被证明存在性能问题。 Instead we now check for duplicates with an initial select 相反，我们现在使用初始选择检查重复项

select key1,key2 from stage group by key1,key2 having count(*) > 1;

and stop the process if records are returned. 如果返回记录，则停止该过程。

根据Redshift中的关键列更新或插入

问题描述

1 个解决方案

解决方案1
2 2014-08-20 22:39:07

根据Redshift中的关键列更新或插入

问题描述

1 个解决方案

解决方案1 2 2014-08-20 22:39:07

解决方案1
2 2014-08-20 22:39:07