根据暂存表中的snapshot_date和id将记录更新/插入不同的快照表中

Question

I have a huge snapshot table (example user_snapshot_all ) broken into the different smaller table on Redshift (ie Postgres) to get the performance gain. 我有一个巨大的快照表（例如user_snapshot_all ）分解为Redshift（即Postgres）上的另一个较小的表，以提高性能。

So, the smaller tables are like (suffix has year_month) 因此，较小的表就像（后缀为year_month）

user_snapshot_1995_1
user_snapshot_1995_2
user_snapshot_1995_3
user_snapshot_1995_4
....
user_snapshot_2016_11

They hold snapshot record for whatever year and month they have suffix 他们拥有后缀的年份和月份的快照记录

I use a staging table user_snapshot_staging to load/update data to these tables incrementally, in 99% cases, it is just the latest year_month one. 我使用登台表user_snapshot_staging将数据加载/更新到这些表中，在99％的情况下，它只是最近的year_month。

But there will be some edge cases like 12:00 midnight when staging table will have data which will span to two tables (for example user_snaspshot_2016_10 and user_snapshot_2016_11 on 2016-11-1) Or another edge case, maybe we need to update few 2 years old snapshot and so staging table will have some 2-year-old records along with a lot of today's snapshot. 但是，在某些情况下，例如午夜12:00，临时表将具有跨两个表的数据（例如2016-11-1上的user_snaspshot_2016_10和user_snapshot_2016_11）或另一个极端情况，也许我们需要更新2年旧快照等暂存表将具有一些已有2年历史的记录以及许多当今的快照。

The question is how I should design my query or code so that it can update or insert data into right month_year snapshot table? 问题是我应该如何设计查询或代码，以便它可以将数据更新或插入正确的month_year快照表中？

All the snapshot tables and staging tables have at least these two columns: 所有快照表和登台表至少具有以下两列：

id
snapshot_date

To clarify further: If it was single user_snapshot_all I could easily update the records by joining the staging table with master table based on snapshot_date and id. 进一步说明：如果是单个user_snapshot_all，我可以通过将登台表与基于snapshot_date和id的主表结合起来来轻松更新记录。 But with these smaller tables segmented by month_year, there is no guarantee that all records from staging tables can be found in one snapshot table. 但是，将这些较小的表按month_year进行细分时，无法保证可以从一个快照表中找到登台表中的所有记录。

Here are use cases Note: Below queries are part of an ETL process, they are not one-off manual one, that is why I need to be automated solution. 这里是用例 注意：下面的查询是ETL流程的一部分，它们不是一次性的手动手册，这就是为什么我需要一个自动化的解决方案。

Scenario 1) Suppose user_snapshot_staging table has 方案1）假设user_snapshot_staging表具有

id  snapshot_date user_detail
100  2016-11-3     jskesljd234
101  2016-11-4     jskesljdfg23
102  2016-11-5     jskesljdbd23
103  2016-11-6     jskesljdw23ds

since all the snapshot belongs to November 2016, all this data will be Inserted/Updated into user_snapshot_2016_11 with following two queries: 由于所有快照都属于2016年11月，因此所有这些数据都将通过以下两个查询插入/更新到user_snapshot_2016_11中：

Insert new: 插入新的：

Insert into user_info_snapshot_2011_11 (id, snapshot_date, user_detail )
from user_info_snapshot_staging source LEFT OUTER JOIN user_info_snapshot_2011_11 target on source.id = target.id where target.id is null
;

Update existing: 更新现有内容：

update user_info_snapshot_2011_11 set snapshot_date=source.snapshot_date, user_detail=source.user_detail 
from user_info_snapshot_staging source INNER JOIN user_info_snapshot_2011_11 target on source.id = target.id where

Scenario 2) Now suppose user_snapshot_staging table has 场景2）现在假设user_snapshot_staging表具有

id  snapshot_date user_detail
1300  2015-01-3     jskesljd234
1301  2015-10-4     jskesljdfg23
1302  2016-11-1     jskesljdbd23
1303  2016-11-2     jskesljdw23ds

Now staging table has snapshots which will require update and insert to different snapshot tables, we cannot just insert/update into user_snapshot_2016_11, but we need to also insert/update into user_snapshot_2015_01 and user_snapshot_2015_10 现在登台表具有需要更新并插入到不同快照表的快照，我们不能仅将其插入/更新到user_snapshot_2016_11，但是我们还需要将其插入/更新到user_snapshot_2015_01和user_snapshot_2015_10

How should I design my query or code which generate the dynamic query to handle these cases so that only appropriate table are joined with user_snapshot_staging table based on data in the staging table? 我应该如何设计查询或生成动态查询的代码来处理这些情况，以便仅基于登台表中的数据将适当的表与user_snapshot_staging表联接？

Let me know if you need further clarifications. 让我知道是否需要进一步说明。 Sorry, it is little tricky to explain. 抱歉，解释起来有点棘手。

Answer 1

You can generate queries with the following approach. 您可以使用以下方法生成查询。 I'll give examples in pseudo-code based on python syntax. 我将基于python语法以伪代码给出示例。

Get year/month combinations you have in your staging database 获取您的登台数据库中的年/月组合


SELECT DISTINCT to_char(date, 'YYYY-MM') FROM user_info_snapshot_staging;

These are your query templates: 这些是您的查询模板：

-- insert_template.sql
INSERT INTO user_info_snapshot_{{ year }}_{{ month }} (id, snapshot_date, user_detail )
FROM user_info_snapshot_staging source LEFT OUTER JOIN user_info_snapshot_{{ year }}_{{ month }} target on source.id = target.id where target.id is null
WHERE DATE_TRUNC('month', source.date) = {{ month }} AND DATE_TRUNC('year', source.date) = {{ year }};

-- update_template.sql
UPDATE user_info_snapshot_{{ year }}_{{ month }} SET snapshot_date=source.snapshot_date, user_detail=source.user_detail
FROM user_info_snapshot_staging source INNER JOIN user_info_snapshot_{{ year }}_{{ month }} target on source.id = target.id where
DATE_TRUNC('month', source.date) = {{ month }} AND DATE_TRUNC('year', source.date) = {{ year }};

Now loop through year/month pairs and execute those queries: 现在遍历年/月对并执行这些查询：

for year_month, in cursor.execute("SELECT to_char('YYYY-MM', date_columns) FROM user_info_snapshot_staging"):
    year, month = year_month.split('-')
    # this is where you generate sql
    sql = template('insert_template', context={
        'year': year,
        'month': month,
    })
    # here you execute it
    cursor.execute(sql)

I would advise against using update if you need to update a lot of records. 如果您需要更新很多记录，我建议不要使用update 。 Further info in this question . 有关此问题的更多信息。

根据暂存表中的snapshot_date和id将记录更新/插入不同的快照表中

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-11-16 16:53:04

根据暂存表中的snapshot_date和id将记录更新/插入不同的快照表中

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-11-16 16:53:04

解决方案1
0 已采纳 2016-11-16 16:53:04