[英]PostgreSQL Table Comparison
I have a table: 我有一张桌子:
CREATE TABLE my_schema.my_data
(
id character varying COLLATE pg_catalog."default" NOT NULL,
name character varying COLLATE pg_catalog."default" NOT NULL,
length numeric(6,4),
width numeric(6,4),
rp numeric(4,2),
CONSTRAINT id_pkey PRIMARY KEY (id)
);
and a temp table: 和一个临时表:
CREATE TEMPORARY TABLE new_data (LIKE my_schema.my_data);
The temporary table is then being filled with a more recent version of the data set that exists in the my_data table. 然后用my_data表中存在的较新版本的数据集填充临时表。
I am attempting to identify the records in the temporary table that have the same primary key as an existing record in the my_data table but have at least one other value that is different. 我试图识别临时表中的记录,这些记录具有与my_data表中的现有记录相同的主键,但是具有至少一个其他不同的值。
My current method is to run a query similar to this example: 我当前的方法是运行类似于此示例的查询:
SELECT temp.id
FROM (SELECT * FROM my_schema.my_data WHERE my_data.id IN ('X2025','X8716','X4091','X2443','X8922','X5929','X3016','X3036','X4829','X9578')) AS orig
RIGHT JOIN (SELECT * FROM pg_temp.new_data WHERE new_data.id IN ('X2025','X8716','X4091','X2443','X8922','X5929','X3016','X3036','X4829','X9578')) AS temp
ON (orig.id = temp.id OR (orig.id IS NULL AND temp.id IS NULL))
AND (orig.name = temp.name OR (orig.name IS NULL AND temp.name IS NULL))
AND (orig.length = temp.length OR (orig.length IS NULL and temp.length IS NULL))
AND (orig.width = temp.width OR (orig.width IS NULL and temp.width IS NULL))
AND (orig.rp = temp.rp OR (orig.rp IS NULL and temp.rp IS NULL))
WHERE orig.id IS NULL;
This seems pretty inefficient and I am not seeing very good response times on larger tables were there are more columns and I am iterating through batches of about 10,000 records. 这似乎效率很低,而且在有更多列的情况下,在较大的表上我看不到很好的响应时间,而且我正在遍历约10,000条记录的批处理。
Any suggestions for identifying the records that are different in a more efficient manner? 有什么建议可以更有效地识别不同的记录?
UPDATE: 更新:
I have a dataset that is pulled fresh on a regular basis. 我有一个定期刷新的数据集。 Unfortunately, I get the full data set each time instead of only new or updated records. 不幸的是,我每次都获得完整的数据集,而不仅仅是新的或更新的记录。 (I am working on fixing this process in the future.) For the time being I just want to update my table to match the latest data pull each day. (我将在将来修复此过程。)目前,我只想更新表以匹配每天的最新数据。 I worked through a process to handle these comparisons and updates but it was super slow. 我通过一个过程来处理这些比较和更新,但是速度非常慢。 My data base table contains import_date and modified_date columns that are currently being filled using triggers. 我的数据库表包含import_date和Modifyed_date列,这些列当前正在使用触发器填充。 Via the triggers every INSERT statement uses the current_date as both the import_date and modified_date for those records. 通过触发器,每个INSERT语句都将current_date用作这些记录的import_date和Modifyed_date。 Additionally, the modified_date is set to current_date via a trigger BEFORE UPDATE. 此外,通过更新前的触发器将Modifyed_date设置为current_date。 As such, I only want to update records that actually experienced a data change in with the most recent data pull. 因此,我只想更新在最近一次数据提取中实际经历过数据更改的记录。 Otherwise, the modified_date column becomes pretty useless, as I won't be able to determine when the values for that record most recently changed. 否则,modified_date列将变得毫无用处,因为我将无法确定该记录的值最近的更改时间。
Current Table: ORIG 电流表:ORIG
(actual table contains about 1 million records) (实际表包含大约一百万条记录)
| | import_date | import_date | modified_date | 修改日期 | id | id | name | 名称 | length | 长度 width | 宽度 rp | rp |
| | 2018-08-17 | 2018-08-17 | 2018-08-17 | 2018-08-17 | 87 | 87 | Blue | 蓝色 | 12.0200| 12.0200 | 8.0503| 8.0503 | 1.82 | 1.82 |
| | 2018-08-17 | 2018-08-17 | 2018-08-17 | 2018-08-17 | 88 | 88 | Red | 红色| 11.0870| 11.0870 | 2.0923| 2.0923 | 1.72 | 1.72 |
| | 2018-08-17 | 2018-08-17 | 2018-08-17 | 2018-08-17 | 89 | 89 | Pink | 粉红色| 15.0870| 15.0870 | 7.9963| 7.9963 | 0.95 | 0.95 |
Temporary Table: TEMP 临时表:TEMP
(Also contains about 1 million records. Will contain all of the primary keys (id column) that exist in the current table but may also contain new primary keys.) (还包含大约一百万条记录。将包含当前表中存在的所有主键(id列),但也可能包含新的主键。)
| | import_date | import_date | modified_date | 修改日期 | id | id | name | 名称 | length | 长度 width | 宽度 rp | rp |
| | NULL | NULL | NULL | NULL | 87 | 87 | Teal | 蓝绿色 12.0200| 12.0200 | 8.0503| 8.0503 | 1.82 | 1.82 |
| | NULL | NULL | NULL | NULL | 88 | 88 | Red | 红色| 11.0870| 11.0870 | 2.0923| 2.0923 | 1.72 | 1.72 |
| | NULL | NULL | NULL | NULL | 89 | 89 | Pink | 粉红色| 15.0870| 15.0870 | 7.9963| 7.9963 | 0.95 | 0.95 |
Using the example data above, I would expect only the first record, id 87, to be updated. 使用上面的示例数据,我希望仅更新第一条记录,即ID 87。 After which my table would look like: 之后,我的表将如下所示:
| | import_date | import_date | modified_date | 修改日期 | id | id | name | 名称 | length | 长度 width | 宽度 rp | rp |
| | 2018-08-17 | 2018-08-17 | 2018-09-12 | 2018-09-12 | 87 | 87 | Teal | 蓝绿色 12.0200| 12.0200 | 8.0503| 8.0503 | 1.82 | 1.82 |
| | 2018-08-17 | 2018-08-17 | 2018-08-17 | 2018-08-17 | 88 | 88 | Red | 红色| 11.0870| 11.0870 | 2.0923| 2.0923 | 1.72 | 1.72 |
| | 2018-08-17 | 2018-08-17 | 2018-08-17 | 2018-08-17 | 89 | 89 | Pink | 粉红色| 15.0870| 15.0870 | 7.9963| 7.9963 | 0.95 | 0.95 |
WHAT WORKED FOR ME: I updated my modified_date trigger function to identify when a new modified date is needed: 我的工作方法:我更新了Modifyddate触发函数,以识别何时需要新的修改日期:
CREATE FUNCTION my_schema.update_mod_date()
RETURNS trigger
LANGUAGE 'plpgsql'
COST 100
VOLATILE NOT LEAKPROOF
AS $BODY$
DECLARE
BEGIN
IF tg_op = 'INSERT' THEN
NEW.modified_date := current_date;
ELSIF tg_op = 'UPDATE' THEN
IF NEW.name IS DISTINCT FROM OLD.name
OR NEW.length IS DISTINCT FROM OLD.length
OR NEW.width IS DISTINCT FROM OLD.width
OR NEW.rp IS DISTINCT FROM OLD.rp THEN
NEW.modified_date := current_date;
ELSE
NEW.modified_date := OLD.modified_date;
END IF;
END IF;
RETURN NEW;
END;
$BODY$;
Then I was able to use the original solution proposed by @EvanCarroll: 然后,我可以使用@EvanCarroll提出的原始解决方案:
BEGIN;
INSERT INTO my_schema.my_data (SELECT * FROM pg_temp.new_data)
ON CONFLICT (id) DO UPDATE SET modified_date=NULL, id=EXCLUDED.id,
name=EXCLUDED.name, length=EXCLUDED.length, width=EXCLUDED.width,
rp=EXCLUDED.rp;
COMMIT;
This ensured that the modified_date only gets changed if one of the other values in the row changed. 这样可以确保只有在该行中的其他值之一发生更改时,才会修改Modifyed_date。
How about joining on the PK, but then only selecting records where the rest of the record is somehow difference, like so: 如何加入PK,然后仅选择记录的其余部分有所不同的记录,例如:
SELECT
new_data.*
FROM
my_data
INNER JOIN
new_data
ON (my_data.id = new_data.id) -- Same PK
AND (ROW(my_data.*) IS DISTINCT FROM ROW(new_data.*)) -- Any difference in other fields
This will return records from the new_data
table with id
that matches records in my_data
, but where any other field(s) do not match. 这将从new_data
表中返回id
与my_data
中的记录匹配的记录,但其他任何字段都不匹配。
Documentation: https://www.postgresql.org/docs/current/static/functions-comparisons.html#ROW-WISE-COMPARISON 文档: https : //www.postgresql.org/docs/current/static/functions-comparisons.html#ROW-WISE-COMPARISON
@EvanCarroll Yes, the end goal is to update the original table using the new dataset. @EvanCarroll是的,最终目标是使用新的数据集更新原始表。 – Nathan Scheiderer 41 mins ago – 41分钟前Nathan Scheiderer
Then you don't want to do this. 然后,您不想这样做。 You want to instead use INSERT ... ON CONFLICT DO UPDATE
. 您想改为使用INSERT ... ON CONFLICT DO UPDATE
。 That's how you upsert in PostgreSQL. 这就是您在PostgreSQL中进行增补的方式。
If you have a row like a modified_time
that you only want updated when the row is updated, have it handled with a trigger. 如果你有一排像modified_time
,你只需要在该行被更新的更新,都将其与触发处理。 Like this . 这样 。 Then you would just write the below like this, 然后,您只需像下面这样写,
INSERT INTO foo
SELECT *
FROM bar
WHERE NOT EXISTS (
SELECT 1
FROM foo
WHERE foo.x = bar.x
AND NOT foo.whatever = bar.whatever
);
Now it won't accept updates on the row unless whatever
is different for each x
. 现在,它不会接受该行更新,除非whatever
是对于每个不同的x
。 Ideally though you wouldn't do that. 理想情况下,尽管您不会这样做。 If the rows must be unique by whatever
I would add that to the index. 如果行必须是唯一的whatever
我想补充一点,到索引中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.