PostgreSQL Table Comparison

Question

I have a table:

CREATE TABLE my_schema.my_data
(
    id character varying COLLATE pg_catalog."default" NOT NULL,
    name character varying COLLATE pg_catalog."default" NOT NULL,
    length numeric(6,4),
    width numeric(6,4),
    rp numeric(4,2),
    CONSTRAINT id_pkey PRIMARY KEY (id)
);

and a temp table:

CREATE TEMPORARY TABLE new_data (LIKE my_schema.my_data);

The temporary table is then being filled with a more recent version of the data set that exists in the my_data table.

I am attempting to identify the records in the temporary table that have the same primary key as an existing record in the my_data table but have at least one other value that is different.

My current method is to run a query similar to this example:

SELECT temp.id 
FROM (SELECT * FROM my_schema.my_data WHERE my_data.id IN ('X2025','X8716','X4091','X2443','X8922','X5929','X3016','X3036','X4829','X9578')) AS orig 
RIGHT JOIN (SELECT * FROM pg_temp.new_data WHERE new_data.id IN ('X2025','X8716','X4091','X2443','X8922','X5929','X3016','X3036','X4829','X9578')) AS temp 
ON (orig.id = temp.id OR (orig.id IS NULL AND temp.id IS NULL))
AND (orig.name = temp.name OR (orig.name IS NULL AND temp.name IS NULL))
AND (orig.length = temp.length OR (orig.length IS NULL and temp.length IS NULL))
AND (orig.width = temp.width OR (orig.width IS NULL and temp.width IS NULL))
AND (orig.rp = temp.rp OR (orig.rp IS NULL and temp.rp IS NULL)) 
WHERE orig.id IS NULL;

This seems pretty inefficient and I am not seeing very good response times on larger tables were there are more columns and I am iterating through batches of about 10,000 records.

Any suggestions for identifying the records that are different in a more efficient manner?

UPDATE:

I have a dataset that is pulled fresh on a regular basis. Unfortunately, I get the full data set each time instead of only new or updated records. (I am working on fixing this process in the future.) For the time being I just want to update my table to match the latest data pull each day. I worked through a process to handle these comparisons and updates but it was super slow. My data base table contains import_date and modified_date columns that are currently being filled using triggers. Via the triggers every INSERT statement uses the current_date as both the import_date and modified_date for those records. Additionally, the modified_date is set to current_date via a trigger BEFORE UPDATE. As such, I only want to update records that actually experienced a data change in with the most recent data pull. Otherwise, the modified_date column becomes pretty useless, as I won't be able to determine when the values for that record most recently changed.

Current Table: ORIG

(actual table contains about 1 million records)

| 2018-08-17 | 2018-08-17 | 87 | Blue | 12.0200| 8.0503| 1.82 |

| 2018-08-17 | 2018-08-17 | 88 | Red | 11.0870| 2.0923| 1.72 |

| 2018-08-17 | 2018-08-17 | 89 | Pink | 15.0870| 7.9963| 0.95 |

Temporary Table: TEMP

(Also contains about 1 million records. Will contain all of the primary keys (id column) that exist in the current table but may also contain new primary keys.)

| NULL | NULL | 87 | Teal | 12.0200| 8.0503| 1.82 |

| NULL | NULL | 88 | Red | 11.0870| 2.0923| 1.72 |

| NULL | NULL | 89 | Pink | 15.0870| 7.9963| 0.95 |

Using the example data above, I would expect only the first record, id 87, to be updated. After which my table would look like:

| 2018-08-17 | 2018-09-12 | 87 | Teal | 12.0200| 8.0503| 1.82 |

| 2018-08-17 | 2018-08-17 | 88 | Red | 11.0870| 2.0923| 1.72 |

| 2018-08-17 | 2018-08-17 | 89 | Pink | 15.0870| 7.9963| 0.95 |

WHAT WORKED FOR ME: I updated my modified_date trigger function to identify when a new modified date is needed:

CREATE FUNCTION my_schema.update_mod_date()
    RETURNS trigger
    LANGUAGE 'plpgsql'
    COST 100
    VOLATILE NOT LEAKPROOF 
AS $BODY$
DECLARE
BEGIN
    IF tg_op = 'INSERT' THEN
        NEW.modified_date := current_date;
    ELSIF tg_op = 'UPDATE' THEN 
        IF NEW.name IS DISTINCT FROM OLD.name
        OR NEW.length IS DISTINCT FROM OLD.length
        OR NEW.width IS DISTINCT FROM OLD.width
        OR NEW.rp IS DISTINCT FROM OLD.rp THEN
            NEW.modified_date := current_date;
        ELSE
            NEW.modified_date := OLD.modified_date;
        END IF;
    END IF;
    RETURN NEW;
END;
$BODY$;

Then I was able to use the original solution proposed by @EvanCarroll:

BEGIN;
INSERT INTO my_schema.my_data (SELECT * FROM pg_temp.new_data) 
ON CONFLICT (id) DO UPDATE SET modified_date=NULL, id=EXCLUDED.id,
name=EXCLUDED.name, length=EXCLUDED.length, width=EXCLUDED.width,
rp=EXCLUDED.rp;
COMMIT;

This ensured that the modified_date only gets changed if one of the other values in the row changed.

Answer 1

How about joining on the PK, but then only selecting records where the rest of the record is somehow difference, like so:

SELECT
    new_data.*
FROM
    my_data
INNER JOIN
    new_data
    ON  (my_data.id = new_data.id) -- Same PK
    AND (ROW(my_data.*) IS DISTINCT FROM ROW(new_data.*)) -- Any difference in other fields

This will return records from the new_data table with id that matches records in my_data , but where any other field(s) do not match.

Documentation: https://www.postgresql.org/docs/current/static/functions-comparisons.html#ROW-WISE-COMPARISON

Answer 2

@EvanCarroll Yes, the end goal is to update the original table using the new dataset. – Nathan Scheiderer 41 mins ago

Then you don't want to do this. You want to instead use INSERT ... ON CONFLICT DO UPDATE . That's how you upsert in PostgreSQL.

Update

If you have a row like a modified_time that you only want updated when the row is updated, have it handled with a trigger. Like this . Then you would just write the below like this,

INSERT INTO foo
SELECT *
FROM bar
WHERE NOT EXISTS (
  SELECT 1
  FROM foo
  WHERE foo.x = bar.x
    AND NOT foo.whatever = bar.whatever
);

Now it won't accept updates on the row unless whatever is different for each x . Ideally though you wouldn't do that. If the rows must be unique by whatever I would add that to the index.

PostgreSQL Table Comparison

Question

2 answers

solution1
0 2018-09-11 23:01:51

solution2
0 ACCPTED 2018-09-12 17:42:02

Update

PostgreSQL Table Comparison

Question

2 answers

solution1 0 2018-09-11 23:01:51

solution2 0 ACCPTED 2018-09-12 17:42:02

Update

solution1
0 2018-09-11 23:01:51

solution2
0 ACCPTED 2018-09-12 17:42:02