简体   繁体   中英

PostgreSQL Table Comparison

I have a table:

CREATE TABLE my_schema.my_data
(
    id character varying COLLATE pg_catalog."default" NOT NULL,
    name character varying COLLATE pg_catalog."default" NOT NULL,
    length numeric(6,4),
    width numeric(6,4),
    rp numeric(4,2),
    CONSTRAINT id_pkey PRIMARY KEY (id)
);

and a temp table:

CREATE TEMPORARY TABLE new_data (LIKE my_schema.my_data);

The temporary table is then being filled with a more recent version of the data set that exists in the my_data table.

I am attempting to identify the records in the temporary table that have the same primary key as an existing record in the my_data table but have at least one other value that is different.

My current method is to run a query similar to this example:

SELECT temp.id 
FROM (SELECT * FROM my_schema.my_data WHERE my_data.id IN ('X2025','X8716','X4091','X2443','X8922','X5929','X3016','X3036','X4829','X9578')) AS orig 
RIGHT JOIN (SELECT * FROM pg_temp.new_data WHERE new_data.id IN ('X2025','X8716','X4091','X2443','X8922','X5929','X3016','X3036','X4829','X9578')) AS temp 
ON (orig.id = temp.id OR (orig.id IS NULL AND temp.id IS NULL))
AND (orig.name = temp.name OR (orig.name IS NULL AND temp.name IS NULL))
AND (orig.length = temp.length OR (orig.length IS NULL and temp.length IS NULL))
AND (orig.width = temp.width OR (orig.width IS NULL and temp.width IS NULL))
AND (orig.rp = temp.rp OR (orig.rp IS NULL and temp.rp IS NULL)) 
WHERE orig.id IS NULL;

This seems pretty inefficient and I am not seeing very good response times on larger tables were there are more columns and I am iterating through batches of about 10,000 records.

Any suggestions for identifying the records that are different in a more efficient manner?

UPDATE:

I have a dataset that is pulled fresh on a regular basis. Unfortunately, I get the full data set each time instead of only new or updated records. (I am working on fixing this process in the future.) For the time being I just want to update my table to match the latest data pull each day. I worked through a process to handle these comparisons and updates but it was super slow. My data base table contains import_date and modified_date columns that are currently being filled using triggers. Via the triggers every INSERT statement uses the current_date as both the import_date and modified_date for those records. Additionally, the modified_date is set to current_date via a trigger BEFORE UPDATE. As such, I only want to update records that actually experienced a data change in with the most recent data pull. Otherwise, the modified_date column becomes pretty useless, as I won't be able to determine when the values for that record most recently changed.

Current Table: ORIG

(actual table contains about 1 million records)

| import_date | modified_date | id | name | length | width | rp |

| 2018-08-17 | 2018-08-17 | 87 | Blue | 12.0200| 8.0503| 1.82 |

| 2018-08-17 | 2018-08-17 | 88 | Red | 11.0870| 2.0923| 1.72 |

| 2018-08-17 | 2018-08-17 | 89 | Pink | 15.0870| 7.9963| 0.95 |

Temporary Table: TEMP

(Also contains about 1 million records. Will contain all of the primary keys (id column) that exist in the current table but may also contain new primary keys.)

| import_date | modified_date | id | name | length | width | rp |

| NULL | NULL | 87 | Teal | 12.0200| 8.0503| 1.82 |

| NULL | NULL | 88 | Red | 11.0870| 2.0923| 1.72 |

| NULL | NULL | 89 | Pink | 15.0870| 7.9963| 0.95 |

Using the example data above, I would expect only the first record, id 87, to be updated. After which my table would look like:

| import_date | modified_date | id | name | length | width | rp |

| 2018-08-17 | 2018-09-12 | 87 | Teal | 12.0200| 8.0503| 1.82 |

| 2018-08-17 | 2018-08-17 | 88 | Red | 11.0870| 2.0923| 1.72 |

| 2018-08-17 | 2018-08-17 | 89 | Pink | 15.0870| 7.9963| 0.95 |

WHAT WORKED FOR ME: I updated my modified_date trigger function to identify when a new modified date is needed:

CREATE FUNCTION my_schema.update_mod_date()
    RETURNS trigger
    LANGUAGE 'plpgsql'
    COST 100
    VOLATILE NOT LEAKPROOF 
AS $BODY$
DECLARE
BEGIN
    IF tg_op = 'INSERT' THEN
        NEW.modified_date := current_date;
    ELSIF tg_op = 'UPDATE' THEN 
        IF NEW.name IS DISTINCT FROM OLD.name
        OR NEW.length IS DISTINCT FROM OLD.length
        OR NEW.width IS DISTINCT FROM OLD.width
        OR NEW.rp IS DISTINCT FROM OLD.rp THEN
            NEW.modified_date := current_date;
        ELSE
            NEW.modified_date := OLD.modified_date;
        END IF;
    END IF;
    RETURN NEW;
END;
$BODY$;

Then I was able to use the original solution proposed by @EvanCarroll:

BEGIN;
INSERT INTO my_schema.my_data (SELECT * FROM pg_temp.new_data) 
ON CONFLICT (id) DO UPDATE SET modified_date=NULL, id=EXCLUDED.id,
name=EXCLUDED.name, length=EXCLUDED.length, width=EXCLUDED.width,
rp=EXCLUDED.rp;
COMMIT;

This ensured that the modified_date only gets changed if one of the other values in the row changed.

How about joining on the PK, but then only selecting records where the rest of the record is somehow difference, like so:

SELECT
    new_data.*
FROM
    my_data
INNER JOIN
    new_data
    ON  (my_data.id = new_data.id) -- Same PK
    AND (ROW(my_data.*) IS DISTINCT FROM ROW(new_data.*)) -- Any difference in other fields

This will return records from the new_data table with id that matches records in my_data , but where any other field(s) do not match.

Documentation: https://www.postgresql.org/docs/current/static/functions-comparisons.html#ROW-WISE-COMPARISON

@EvanCarroll Yes, the end goal is to update the original table using the new dataset. – Nathan Scheiderer 41 mins ago

Then you don't want to do this. You want to instead use INSERT ... ON CONFLICT DO UPDATE . That's how you upsert in PostgreSQL.

Update

If you have a row like a modified_time that you only want updated when the row is updated, have it handled with a trigger. Like this . Then you would just write the below like this,

INSERT INTO foo
SELECT *
FROM bar
WHERE NOT EXISTS (
  SELECT 1
  FROM foo
  WHERE foo.x = bar.x
    AND NOT foo.whatever = bar.whatever
);

Now it won't accept updates on the row unless whatever is different for each x . Ideally though you wouldn't do that. If the rows must be unique by whatever I would add that to the index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM