[英]Update/Insert records into different snapshot tables based on snapshot_date and id from a staging table
I have a huge snapshot table (example user_snapshot_all
) broken into the different smaller table on Redshift (ie Postgres) to get the performance gain. 我有一个巨大的快照表(例如user_snapshot_all
)分解为Redshift(即Postgres)上的另一个较小的表,以提高性能。
So, the smaller tables are like (suffix has year_month) 因此,较小的表就像(后缀为year_month)
user_snapshot_1995_1
user_snapshot_1995_2
user_snapshot_1995_3
user_snapshot_1995_4
....
user_snapshot_2016_11
They hold snapshot record for whatever year and month they have suffix 他们拥有后缀的年份和月份的快照记录
I use a staging table user_snapshot_staging
to load/update data to these tables incrementally, in 99% cases, it is just the latest year_month one. 我使用登台表user_snapshot_staging
将数据加载/更新到这些表中,在99%的情况下,它只是最近的year_month。
But there will be some edge cases like 12:00 midnight when staging table will have data which will span to two tables (for example user_snaspshot_2016_10 and user_snapshot_2016_11 on 2016-11-1) Or another edge case, maybe we need to update few 2 years old snapshot and so staging table will have some 2-year-old records along with a lot of today's snapshot. 但是,在某些情况下,例如午夜12:00,临时表将具有跨两个表的数据(例如2016-11-1上的user_snaspshot_2016_10和user_snapshot_2016_11)或另一个极端情况,也许我们需要更新2年旧快照等暂存表将具有一些已有2年历史的记录以及许多当今的快照。
The question is how I should design my query or code so that it can update or insert data into right month_year snapshot table? 问题是我应该如何设计查询或代码,以便它可以将数据更新或插入正确的month_year快照表中?
All the snapshot tables and staging tables have at least these two columns: 所有快照表和登台表至少具有以下两列:
id
snapshot_date
To clarify further: If it was single user_snapshot_all I could easily update the records by joining the staging table with master table based on snapshot_date and id. 进一步说明:如果是单个user_snapshot_all,我可以通过将登台表与基于snapshot_date和id的主表结合起来来轻松更新记录。 But with these smaller tables segmented by month_year, there is no guarantee that all records from staging tables can be found in one snapshot table. 但是,将这些较小的表按month_year进行细分时,无法保证可以从一个快照表中找到登台表中的所有记录。
Here are use cases Note: Below queries are part of an ETL process, they are not one-off manual one, that is why I need to be automated solution. 这里是用例 注意:下面的查询是ETL流程的一部分,它们不是一次性的手动手册,这就是为什么我需要一个自动化的解决方案。
Scenario 1) Suppose user_snapshot_staging table has 方案1)假设user_snapshot_staging表具有
id snapshot_date user_detail
100 2016-11-3 jskesljd234
101 2016-11-4 jskesljdfg23
102 2016-11-5 jskesljdbd23
103 2016-11-6 jskesljdw23ds
since all the snapshot belongs to November 2016, all this data will be Inserted/Updated into user_snapshot_2016_11 with following two queries: 由于所有快照都属于2016年11月,因此所有这些数据都将通过以下两个查询插入/更新到user_snapshot_2016_11中:
Insert new: 插入新的:
Insert into user_info_snapshot_2011_11 (id, snapshot_date, user_detail )
from user_info_snapshot_staging source LEFT OUTER JOIN user_info_snapshot_2011_11 target on source.id = target.id where target.id is null
;
Update existing: 更新现有内容:
update user_info_snapshot_2011_11 set snapshot_date=source.snapshot_date, user_detail=source.user_detail
from user_info_snapshot_staging source INNER JOIN user_info_snapshot_2011_11 target on source.id = target.id where
Scenario 2) Now suppose user_snapshot_staging table has 场景2)现在假设user_snapshot_staging表具有
id snapshot_date user_detail
1300 2015-01-3 jskesljd234
1301 2015-10-4 jskesljdfg23
1302 2016-11-1 jskesljdbd23
1303 2016-11-2 jskesljdw23ds
Now staging table has snapshots which will require update and insert to different snapshot tables, we cannot just insert/update into user_snapshot_2016_11, but we need to also insert/update into user_snapshot_2015_01 and user_snapshot_2015_10 现在登台表具有需要更新并插入到不同快照表的快照,我们不能仅将其插入/更新到user_snapshot_2016_11,但是我们还需要将其插入/更新到user_snapshot_2015_01和user_snapshot_2015_10
How should I design my query or code which generate the dynamic query to handle these cases so that only appropriate table are joined with user_snapshot_staging table based on data in the staging table? 我应该如何设计查询或生成动态查询的代码来处理这些情况,以便仅基于登台表中的数据将适当的表与user_snapshot_staging表联接?
Let me know if you need further clarifications. 让我知道是否需要进一步说明。 Sorry, it is little tricky to explain. 抱歉,解释起来有点棘手。
You can generate queries with the following approach. 您可以使用以下方法生成查询。 I'll give examples in pseudo-code based on python syntax. 我将基于python语法以伪代码给出示例。
SELECT DISTINCT to_char(date, 'YYYY-MM') FROM user_info_snapshot_staging;
-- insert_template.sql
INSERT INTO user_info_snapshot_{{ year }}_{{ month }} (id, snapshot_date, user_detail )
FROM user_info_snapshot_staging source LEFT OUTER JOIN user_info_snapshot_{{ year }}_{{ month }} target on source.id = target.id where target.id is null
WHERE DATE_TRUNC('month', source.date) = {{ month }} AND DATE_TRUNC('year', source.date) = {{ year }};
-- update_template.sql
UPDATE user_info_snapshot_{{ year }}_{{ month }} SET snapshot_date=source.snapshot_date, user_detail=source.user_detail
FROM user_info_snapshot_staging source INNER JOIN user_info_snapshot_{{ year }}_{{ month }} target on source.id = target.id where
DATE_TRUNC('month', source.date) = {{ month }} AND DATE_TRUNC('year', source.date) = {{ year }};
Now loop through year/month pairs and execute those queries: 现在遍历年/月对并执行这些查询:
for year_month, in cursor.execute("SELECT to_char('YYYY-MM', date_columns) FROM user_info_snapshot_staging"):
year, month = year_month.split('-')
# this is where you generate sql
sql = template('insert_template', context={
'year': year,
'month': month,
})
# here you execute it
cursor.execute(sql)
I would advise against using update
if you need to update a lot of records. 如果您需要更新很多记录,我建议不要使用update
。 Further info in this question . 有关此问题的更多信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.