删除 ID 在 Redshift 中多次存在的行

Question

I'm playing around with Redshift for practice.我正在玩 Redshift 进行练习。 I'm loading data into a Redshift table on a daily basis, and trying to remove duplicates after each ingestion.我每天都将数据加载到 Redshift 表中，并尝试在每次摄取后删除重复项。 I initially tried the following to create a new table with distinct records, then deleting the old one.我最初尝试了以下方法来创建一个包含不同记录的新表，然后删除旧表。

CREATE TABLE reddit_new AS SELECT DISTINCT * FROM reddit;
ALTER TABLE reddit RENAME TO reddit_old;
ALTER TABLE reddit_new RENAME TO reddit;
DROP TABLE reddit_old;

However I then realised that although some rows have the same ID, there are certain columns that are always different.然而我后来意识到，虽然有些行具有相同的 ID，但有些列总是不同的。

So rather than removing duplicate rows, I need to remove rows where the ID is a duplicate.因此，我需要删除 ID 重复的行，而不是删除重复的行。 Ideally, I want to keep the record that has the most recent date.理想情况下，我想保留最近日期的记录。 If they had the same date, then just remove either or.如果他们有相同的日期，则只需删除其中一个或。 So in the following example, it would just be row 2 being removed.所以在下面的示例中，它只会删除第 2 行。

ID      Date
34      2022-02-01
23      2022-03-05
12      2022-03-06
23      2022-03-18

I also thought about updating my COPY command to only add records where ID doesn't exist in table, but not sure if that's possible.我还考虑过更新我的 COPY 命令以仅添加表中不存在 ID 的记录，但不确定这是否可能。 This is my current COPY command, which runs daily, copying from a new file in S3:这是我当前的 COPY 命令，它每天运行，从 S3 中的一个新文件复制：

f"COPY public.Reddit FROM '{s3_file}' iam_role '{role_string}' IGNOREHEADER 1 DELIMITER ',' CSV"

Answer 1

A common pattern to address this is not to copy into your table directly, but rather first to a (possibly temporary) staging location, then use the data in that table to delete from the primary.解决这个问题的一个常见模式是不直接复制到您的表中，而是首先复制到一个（可能是临时的）暂存位置，然后使用该表中的数据从主表中删除。

CREATE TABLE staging LIKE "Reddit";
COPY staging FROM '<s3_file>' iam_role '<role>' ignoreheader 1 delimiter ',' csv;
DELETE FROM public."Reddit"
USING staging
WHERE 
  public."Reddit"."ID" = staging."ID"
  AND public."Reddit"."Date" <= staging."Date";
ALTER TABLE public."Reddit" append FROM staging;
DROP TABLE if EXISTS staging;

Here I have not used a temporary just so that alter append can work, but you can use insert into from a temporary table instead.在这里我没有使用临时表，以便alter append可以工作，但您可以使用从临时表insert into 。

删除 ID 在 Redshift 中多次存在的行

问题描述

1 个解决方案

解决方案1
0 2022-03-22 14:01:42

删除 ID 在 Redshift 中多次存在的行

问题描述

1 个解决方案

解决方案1 0 2022-03-22 14:01:42

解决方案1
0 2022-03-22 14:01:42