简体   繁体   English

根据暂存表中的snapshot_date和id将记录更新/插入不同的快照表中

[英]Update/Insert records into different snapshot tables based on snapshot_date and id from a staging table

I have a huge snapshot table (example user_snapshot_all ) broken into the different smaller table on Redshift (ie Postgres) to get the performance gain. 我有一个巨大的快照表(例如user_snapshot_all )分解为Redshift(即Postgres)上的另一个较小的表,以提高性能。

So, the smaller tables are like (suffix has year_month) 因此,较小的表就像(后缀为year_month)

user_snapshot_1995_1
user_snapshot_1995_2
user_snapshot_1995_3
user_snapshot_1995_4
....
user_snapshot_2016_11

They hold snapshot record for whatever year and month they have suffix 他们拥有后缀的年份和月份的快照记录

I use a staging table user_snapshot_staging to load/update data to these tables incrementally, in 99% cases, it is just the latest year_month one. 我使用登台表user_snapshot_staging将数据加载/更新到这些表中,在99%的情况下,它只是最近的year_month。

But there will be some edge cases like 12:00 midnight when staging table will have data which will span to two tables (for example user_snaspshot_2016_10 and user_snapshot_2016_11 on 2016-11-1) Or another edge case, maybe we need to update few 2 years old snapshot and so staging table will have some 2-year-old records along with a lot of today's snapshot. 但是,在某些情况下,例如午夜12:00,临时表将具有跨两个表的数据(例如2016-11-1上的user_snaspshot_2016_10和user_snapshot_2016_11)或另一个极端情况,也许我们需要更新2年旧快照等暂存表将具有一些已有2年历史的记录以及许多当今的快照。

The question is how I should design my query or code so that it can update or insert data into right month_year snapshot table? 问题是我应该如何设计查询或代码,以便它可以将数据更新或插入正确的month_year快照表中?

All the snapshot tables and staging tables have at least these two columns: 所有快照表和登台表至少具有以下两列:

id
snapshot_date

To clarify further: If it was single user_snapshot_all I could easily update the records by joining the staging table with master table based on snapshot_date and id. 进一步说明:如果是单个user_snapshot_all,我可以通过将登台表与基于snapshot_date和id的主表结合起来来轻松更新记录。 But with these smaller tables segmented by month_year, there is no guarantee that all records from staging tables can be found in one snapshot table. 但是,将这些较小的表按month_year进行细分时,无法保证可以从一个快照表中找到登台表中的所有记录。

Here are use cases Note: Below queries are part of an ETL process, they are not one-off manual one, that is why I need to be automated solution. 这里是用例 注意:下面的查询是ETL流程的一部分,它们不是一次性的手动手册,这就是为什么我需要一个自动化的解决方案。

Scenario 1) Suppose user_snapshot_staging table has 方案1)假设user_snapshot_staging表具有

id  snapshot_date user_detail
100  2016-11-3     jskesljd234
101  2016-11-4     jskesljdfg23
102  2016-11-5     jskesljdbd23
103  2016-11-6     jskesljdw23ds

since all the snapshot belongs to November 2016, all this data will be Inserted/Updated into user_snapshot_2016_11 with following two queries: 由于所有快照都属于2016年11月,因此所有这些数据都将通过以下两个查询插入/更新到user_snapshot_2016_11中:

Insert new: 插入新的:

Insert into user_info_snapshot_2011_11 (id, snapshot_date, user_detail )
from user_info_snapshot_staging source LEFT OUTER JOIN user_info_snapshot_2011_11 target on source.id = target.id where target.id is null
;

Update existing: 更新现有内容:

update user_info_snapshot_2011_11 set snapshot_date=source.snapshot_date, user_detail=source.user_detail 
from user_info_snapshot_staging source INNER JOIN user_info_snapshot_2011_11 target on source.id = target.id where

Scenario 2) Now suppose user_snapshot_staging table has 场景2)现在假设user_snapshot_staging表具有

id  snapshot_date user_detail
1300  2015-01-3     jskesljd234
1301  2015-10-4     jskesljdfg23
1302  2016-11-1     jskesljdbd23
1303  2016-11-2     jskesljdw23ds

Now staging table has snapshots which will require update and insert to different snapshot tables, we cannot just insert/update into user_snapshot_2016_11, but we need to also insert/update into user_snapshot_2015_01 and user_snapshot_2015_10 现在登台表具有需要更新并插入到不同快照表的快照,我们不能仅将其插入/更新到user_snapshot_2016_11,但是我们需要将其插入/更新到user_snapshot_2015_01和user_snapshot_2015_10

How should I design my query or code which generate the dynamic query to handle these cases so that only appropriate table are joined with user_snapshot_staging table based on data in the staging table? 我应该如何设计查询或生成动态查询的代码来处理这些情况,以便仅基于登台表中的数据将适当的表与user_snapshot_staging表联接?

Let me know if you need further clarifications. 让我知道是否需要进一步说明。 Sorry, it is little tricky to explain. 抱歉,解释起来有点棘手。

You can generate queries with the following approach. 您可以使用以下方法生成查询。 I'll give examples in pseudo-code based on python syntax. 我将基于python语法以伪代码给出示例。

  1. Get year/month combinations you have in your staging database 获取您的登台数据库中的年/月组合

SELECT DISTINCT to_char(date, 'YYYY-MM') FROM user_info_snapshot_staging;
  1. These are your query templates: 这些是您的查询模板:
-- insert_template.sql
INSERT INTO user_info_snapshot_{{ year }}_{{ month }} (id, snapshot_date, user_detail )
FROM user_info_snapshot_staging source LEFT OUTER JOIN user_info_snapshot_{{ year }}_{{ month }} target on source.id = target.id where target.id is null
WHERE DATE_TRUNC('month', source.date) = {{ month }} AND DATE_TRUNC('year', source.date) = {{ year }};

-- update_template.sql
UPDATE user_info_snapshot_{{ year }}_{{ month }} SET snapshot_date=source.snapshot_date, user_detail=source.user_detail
FROM user_info_snapshot_staging source INNER JOIN user_info_snapshot_{{ year }}_{{ month }} target on source.id = target.id where
DATE_TRUNC('month', source.date) = {{ month }} AND DATE_TRUNC('year', source.date) = {{ year }};

Now loop through year/month pairs and execute those queries: 现在遍历年/月对并执行这些查询:

for year_month, in cursor.execute("SELECT to_char('YYYY-MM', date_columns) FROM user_info_snapshot_staging"):
    year, month = year_month.split('-')
    # this is where you generate sql
    sql = template('insert_template', context={
        'year': year,
        'month': month,
    })
    # here you execute it
    cursor.execute(sql)

I would advise against using update if you need to update a lot of records. 如果您需要更新很多记录,我建议不要使用update Further info in this question . 有关此问题的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将所有贷款(插入贷款日期,更新返回日期)记录存储在一个表或2个表(带有历史记录)中更好? - Storing all loan (insert loan date, update return date) records in one table or 2 tables (with history) is better? 对SQL Server数据库快照执行INSERT,UPDATE和DELETE - Execute an INSERT, UPDATE and DELETE against a SQL Server Database Snapshot 更新/插入基于另一个表ReferenceID的大表记录。 - Update/Insert Large table records based on another Table ReferenceID. 使用表中的一个字段来存储来自两个不同潜在表的记录的 ID 引用是不好的做法吗? - Is it bad practice to use one field in a table to store an ID reference for records from two different potential tables? 如何基于两个不同表中存在的记录更新列值? - How to update column value based on records existing in two different tables? 是否包含维度表中累积快照表的所有信息? - Including all info from an accumulating snapshot table in a dimension table? 更新记录之前的数据库快照 - Database snapshot before update a record 可序列化的快照隔离和选择更新 - serializable snapshot isolation and select for update 无论如何,是否要使用来自2个不同表的2个不同外键将数据插入表中? - Is there anyway to insert data to table with 2 different foreign keys from 2 different tables? 不同SQL Server版本的快照 - Snapshot for different SQL Server Version
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM