简体   繁体   English

删除 ID 在 Redshift 中多次存在的行

[英]Delete rows where ID exists more than once in Redshift

I'm playing around with Redshift for practice.我正在玩 Redshift 进行练习。 I'm loading data into a Redshift table on a daily basis, and trying to remove duplicates after each ingestion.我每天都将数据加载到 Redshift 表中,并尝试在每次摄取后删除重复项。 I initially tried the following to create a new table with distinct records, then deleting the old one.我最初尝试了以下方法来创建一个包含不同记录的新表,然后删除旧表。

CREATE TABLE reddit_new AS SELECT DISTINCT * FROM reddit;
ALTER TABLE reddit RENAME TO reddit_old;
ALTER TABLE reddit_new RENAME TO reddit;
DROP TABLE reddit_old;

However I then realised that although some rows have the same ID, there are certain columns that are always different.然而我后来意识到,虽然有些行具有相同的 ID,但有些列总是不同的。

So rather than removing duplicate rows, I need to remove rows where the ID is a duplicate.因此,我需要删除 ID 重复的行,而不是删除重复的行。 Ideally, I want to keep the record that has the most recent date.理想情况下,我想保留最近日期的记录。 If they had the same date, then just remove either or.如果他们有相同的日期,则只需删除其中一个或。 So in the following example, it would just be row 2 being removed.所以在下面的示例中,它只会删除第 2 行。

ID      Date
34      2022-02-01
23      2022-03-05
12      2022-03-06
23      2022-03-18

I also thought about updating my COPY command to only add records where ID doesn't exist in table, but not sure if that's possible.我还考虑过更新我的 COPY 命令以仅添加表中不存在 ID 的记录,但不确定这是否可能。 This is my current COPY command, which runs daily, copying from a new file in S3:这是我当前的 COPY 命令,它每天运行,从 S3 中的一个新文件复制:

f"COPY public.Reddit FROM '{s3_file}' iam_role '{role_string}' IGNOREHEADER 1 DELIMITER ',' CSV"

A common pattern to address this is not to copy into your table directly, but rather first to a (possibly temporary) staging location, then use the data in that table to delete from the primary.解决这个问题的一个常见模式是不直接复制到您的表中,而是首先复制到一个(可能是临时的)暂存位置,然后使用该表中的数据从主表中删除。

CREATE TABLE staging LIKE "Reddit";
COPY staging FROM '<s3_file>' iam_role '<role>' ignoreheader 1 delimiter ',' csv;
DELETE FROM public."Reddit"
USING staging
WHERE 
  public."Reddit"."ID" = staging."ID"
  AND public."Reddit"."Date" <= staging."Date";
ALTER TABLE public."Reddit" append FROM staging;
DROP TABLE if EXISTS staging;

Here I have not used a temporary just so that alter append can work, but you can use insert into from a temporary table instead.在这里我没有使用临时表,以便alter append可以工作,但您可以使用从临时表insert into

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何创建一个新表,只保留Bigquery中相同ID下超过5条数据记录的行 - How to create a new table that only keeps rows with more than 5 data records under the same id in Bigquery 错误默认的 Firebase 应用已经存在。 这意味着您多次调用 initialize_app() 而没有提供应用程序名称作为 - ERROR The default Firebase app already exists. This means you called initialize_app() more than once without providing an app name as the Redshift 更改表(如果不存在) - Redshift Alter table if not exists 如何在红移中复制 IF NOT EXISTS - How to replicate IF NOT EXISTS in redshift 超过 400 万行的 Big Query Export 不工作 - Big Query Export with more than 4 million rows not working Redshift WHERE 查找当前月份 - Redshift WHERE to find current month 获取更多关于 Gremlin 遍历的数据而不仅仅是节点 ID? - Getting more data on Gremlin traversal than just node id? Cloud Scheduler 在调度期间多次调用 Cloud Function - Cloud Scheduler invokes Cloud Function more than once during schedule 在 Athena 中创建视图,但“多次指定列名” - Create View in Athena but "column name specified more than once" 成对列比较的“NOT IN”/“NOT EXISTS”在 Redshift 中不起作用 - "NOT IN"/"NOT EXISTS" for pairwise column comparison not working in Redshift
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM