简体   繁体   中英

Insert record into PostgreSQL table and expire old record

I have a set of data that gets updated periodically by a client. Once a month or so we will download a new set of this data. The dataset is about 50k records with a couple hundred columns of data.

I am trying to create a database that houses all of this data so we can run our own analysis on it. I'm using PostgreSQL and Python (psycopg2).
Occasionally, the client will add columns to the dataset, so there are a number of steps I want to take:

  1. Add new records to the database table
  2. Compare the old set of data with the new set of data and update the table where necessary
  3. Keep the old records, and either add an "expired" flag, or an "db_expire_date" to keep track of whether a record is active or expired
  4. Add any new columns of data to the database for all records

I know how to add new records to the database (1) using INSERT INTO, and how to add new columns of data to the database (4) using ALTER TABLE. But having issues with (2) and (3). I figured out how to update a record, using the following code:

rows = zip(*[update_records[col] for col in update_records])
cursor = conn.cursor()
cursor.execute("""CREATE TEMP TABLE temptable (""" + schema_list + """) ON COMMIT DROP""")
cursor.executemany("""INSERT INTO temptable (""" + var +""") VALUES ("""+ perc_s + """)""", rows)
cursor.execute("""
    UPDATE tracking.test_table
    SET mfg = temptable.mfg, db_updt_dt = CURRENT_TIMESTAMP
    FROM temptable
    WHERE temptable.app_id = tracking.test_table.app_id;
    """);
cursor.rowcount
conn.commit()
cursor.close()
conn.close()

However, this just updated the record based on the app_id as the primary key.
What I'd like to figure out is how to keep the original record and set it as "expired" and then create a new, updated record. It seems that "app_id" shouldn't be my primary key, so i've created a new primary key as '"primary_key" INT GENERATED ALWAYS AS IDENTITY not null,'.

I'm just not sure where to go from here. I think that I could probably just use INSERT INTO to send the new records to the database. But i'm not sure how to "expire" the old records that way. Possibly I could use UPDATE table to set the older values to "expired". But I am wondering if there is a more straightforward way to do this.

I hope my question is clear. I'm hoping someone can point me in the right direction. Thanks

You're looking for the concept of a "natural key", which is how you would identify a unique row, regardless of what the explicit logical constraints on the table are.

This means that you're spot on that you need to change your primary key to be more inclusive. Your new primary key doesn't actually help you decipher which row you are looking for once you have both in there unless you already know which row you are looking for (that "identity" field).

I can think of two likely candidates to add to your natural key: date, or batch.

Either way, you would look for "App = X, [Date|batch] = Y" in the data to find that one. Batch would be upload 1, upload 2, etc. You just make it up, or derive it from the date, or something along those lines.

If you aren't sure which to add, and you aren't ever going to upload multiple times in one day, I would go with Date. That will give you more visibility over time, as you can see when and how often things change.

Once you have a natural key, you want to make it explicit in your data. You can either keep your identity column (see: Surrogate Key) or you can have a compound primary key. With no other input or constraints, I would go with a compound primary key for your situation.

I'm a MySQL DBA, so I'm cribbing a bit from the docs here: https://www.postgresqltutorial.com/postgresql-primary-key/

You do NOT want this:

CREATE TABLE test_table (
    app_id INTEGER PRIMARY KEY,
    date DATE,
    active BOOLEAN
);

Instead, you want this:

CREATE TABLE test_table (
    app_id INTEGER,
    date DATE,
    active BOOLEAN,
    PRIMARY KEY (app_id, date)
);

I've added an active column here as well, since you wanted to deactivate rows. This isn't explicitly necessary from what you've described though - you can always assume the most recent upload is active. Or you can expand the columns to have a "active_start" date and an "active_end" date, which will enable another set of queries. But for what you've stated here so far, just the date column should suffice. :)

A pretty standard data warehousing technique is to define two additional date fields, a from-effective-date and a to-effective-date. You only append rows, never update. You add the candidate record if the source primary key does not exist in your table OR if any column value is different from the most recently added prior record in your table with the same primary key. (Each record supersedes the last).

As you add your record to the table you do 3 things:

  1. The New record's from-effective-date gets the transaction file's date
  2. The New record's to-effective-date gets a date WAY in the future, like 9999-12-31. The important thing here is that it will not expire until you say so.
  3. The most recent prior record (the one you compared values for changes) has its to-effective-date Updated to the transaction file's date minus one day. This has the effect of expiring the old record.

This creates a chain of records with the same source primary key with each one covering a non-overlapping time period. This format is surprisingly easy to select from:

  • If you want to reproduce the most current transaction file you select Where to-effective-date > Current Date
  • If you want to reproduce the transaction file at any date for a report, you select Where myreportdate Between from-effective-date And to-effective-date.
  • If you want the entire update history for a key you select * Where the key = mykeyvalue Order By from-effective-date.

The only thing that is ugly about this scheme is when columns are added, the comparison test also must be altered to include those new columns in case something changes. If you want that to be dynamic, you're going to have to loop through the reflection meta data for each column in the table, but Python will need to know how comparing a text field might be different from comparing a BLOB, for example.

If you actually care about having a primary key (many data warehouses do not have primary keys) you can define a compound key on the source primary key + one of those effective dates, it doesn't really matter which one.

For step 2) First, you have to identify the records that have the same data for this you can run a select query with where clause before inserting any recode and count the number of records you receive as output. If the count is more than 0 don't insert the recode otherwise you can insert the recode.

For step 3) For this, you can insert a column as you mention above with the name 'db_expire_date' and insert the expiration value at the time of record insertion only.

You can also use a column like 'is_expire' but for that, you need to add a cron job that can update the DB periodically for the value of this column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM