[英]Delete rows where ID exists more than once in Redshift
I'm playing around with Redshift for practice.我正在玩 Redshift 进行练习。 I'm loading data into a Redshift table on a daily basis, and trying to remove duplicates after each ingestion.
我每天都将数据加载到 Redshift 表中,并尝试在每次摄取后删除重复项。 I initially tried the following to create a new table with distinct records, then deleting the old one.
我最初尝试了以下方法来创建一个包含不同记录的新表,然后删除旧表。
CREATE TABLE reddit_new AS SELECT DISTINCT * FROM reddit;
ALTER TABLE reddit RENAME TO reddit_old;
ALTER TABLE reddit_new RENAME TO reddit;
DROP TABLE reddit_old;
However I then realised that although some rows have the same ID, there are certain columns that are always different.然而我后来意识到,虽然有些行具有相同的 ID,但有些列总是不同的。
So rather than removing duplicate rows, I need to remove rows where the ID is a duplicate.因此,我需要删除 ID 重复的行,而不是删除重复的行。 Ideally, I want to keep the record that has the most recent date.
理想情况下,我想保留最近日期的记录。 If they had the same date, then just remove either or.
如果他们有相同的日期,则只需删除其中一个或。 So in the following example, it would just be row 2 being removed.
所以在下面的示例中,它只会删除第 2 行。
ID Date
34 2022-02-01
23 2022-03-05
12 2022-03-06
23 2022-03-18
I also thought about updating my COPY command to only add records where ID doesn't exist in table, but not sure if that's possible.我还考虑过更新我的 COPY 命令以仅添加表中不存在 ID 的记录,但不确定这是否可能。 This is my current COPY command, which runs daily, copying from a new file in S3:
这是我当前的 COPY 命令,它每天运行,从 S3 中的一个新文件复制:
f"COPY public.Reddit FROM '{s3_file}' iam_role '{role_string}' IGNOREHEADER 1 DELIMITER ',' CSV"
A common pattern to address this is not to copy into your table directly, but rather first to a (possibly temporary) staging location, then use the data in that table to delete from the primary.解决这个问题的一个常见模式是不直接复制到您的表中,而是首先复制到一个(可能是临时的)暂存位置,然后使用该表中的数据从主表中删除。
CREATE TABLE staging LIKE "Reddit";
COPY staging FROM '<s3_file>' iam_role '<role>' ignoreheader 1 delimiter ',' csv;
DELETE FROM public."Reddit"
USING staging
WHERE
public."Reddit"."ID" = staging."ID"
AND public."Reddit"."Date" <= staging."Date";
ALTER TABLE public."Reddit" append FROM staging;
DROP TABLE if EXISTS staging;
Here I have not used a temporary just so that alter append
can work, but you can use insert into
from a temporary table instead.在这里我没有使用临时表,以便
alter append
可以工作,但您可以使用从临时表insert into
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.