简体   繁体   English

查找增量/CDC 记录以在 Bigquery 中捕获 Just Inserts 或 Updates 的逻辑

[英]Logic to find out deltas/CDC records for capturing Just Inserts or Updates in Bigquery

I have some raw data sitting in Big-query tables which are truncate load and my daily ETL feed which runs on these raw BQ tables is a daily snapshot of agents which are the daily extracts below.我有一些原始数据位于截断负载的大查询表中,我在这些原始 BQ 表上运行的每日 ETL 提要是代理的每日快照,这些代理是下面的每日提取物。

To give more background,I am trying to implement a Insert only table to implement this concept of Virtualized SCD type2 logic .This article focuses on implementing SCD type 2 with delta extracts directly.为了提供更多背景信息,我正在尝试实现一个仅插入表来实现虚拟化 SCD type2 逻辑的概念。本文重点介绍直接使用增量提取实现 SCD type 2。

my requirement is to design a logic/code to figure out this "Insert only" daily extracts so that I can build that virtualized SCD 2 table.我的要求是设计一个逻辑/代码来找出这个“仅插入”的每日提取物,这样我就可以构建那个虚拟化的 SCD 2 表。 I am thinking to put every daily extract in its own daily partition in Big-query table so that I have all the daily changes in one final table to build this view on.我正在考虑将每个每日提取物放入大查询表中其自己的每日分区中,以便我将所有每日更改都放在一个最终表中以构建此视图。

What is an efficient logic/code/design to find the delta extracts everyday and save it in a table of inserts (may be partioned -see final table) in big query?每天查找增量提取并将其保存在大查询中的插入表(可能已分区 - 请参阅最终表)中的有效逻辑/代码/设计是什么?

Daily extract on 2022-03-01每日摘录于 2022-03-01

Agent_Key Agent_Key Agent_name代理名称 MD5_CD MD5_CD row_eff_ts row_eff_ts
12345 12345 Josh乔什 abcde abcde 2022-03-01 04:14:06 2022-03-01 04:14:06

Delta Extract on 2022-03-01 should look like 2022-03-01 上的 Delta Extract 应该看起来像

Agent_Key Agent_Key Agent_name代理名称 MD5_CD MD5_CD row_eff_ts row_eff_ts
12345 12345 Josh乔什 abcde abcde 2022-03-01 04:14:06 2022-03-01 04:14:06

Daily extract on 2022-03-02 (New record for 12346) 2022-03-02每日摘录(新记录12346)

Agent_Key Agent_Key Agent_name代理名称 MD5_CD MD5_CD row_eff_ts row_eff_ts
12345 12345 Josh乔什 abcde abcde 2022-03-02 04:14:06 2022-03-02 04:14:06
12346 12346 Mark标记 fghij fghij 2022-03-02 04:14:06 2022-03-02 04:14:06

Delta Extract on 2022-03-02 should look like (should Only capture changes) 2022-03-02 上的 Delta Extract 应该看起来像(应该只捕获更改)

Agent_Key Agent_Key Agent_name代理名称 MD5_CD MD5_CD row_eff_ts row_eff_ts
12346 12346 Mark标记 fghij fghij 2022-03-02 04:14:06 2022-03-02 04:14:06

Daily extract on 2022-03-03 (updated record for 12345) 2022-03-03 每日摘录(更新记录为 12345)

Agent_Key Agent_Key Agent_name代理名称 MD5_CD MD5_CD row_eff_ts row_eff_ts
12345 12345 Josher乔舍 mnopi姆诺皮 2022-03-03 04:14:06 2022-03-03 04:14:06
12346 12346 Mark标记 fghij fghij 2022-03-02 04:14:06 2022-03-02 04:14:06

Delta Extract on 2022-03-03 should look like (should only capture changes) 2022-03-03 上的 Delta Extract 应该看起来像(应该只捕获更改)

Agent_Key Agent_Key Agent_name代理名称 MD5_CD MD5_CD row_eff_ts row_eff_ts
12345 12345 Josher乔舍 mnopi姆诺皮 2022-03-03 04:14:06 2022-03-03 04:14:06

I have to build this final table of Inserts(new +changed records) by appending everyday's delta extract so that I can build a view to calculate my row_end_eff_ts我必须通过附加每天的增量提取来构建插入(新+更改记录)的最终表,以便我可以构建一个视图来计算我的 row_end_eff_ts

Final table of Inserts should look like插入的最终表应该看起来像

Agent_Key Agent_Key Agent_name代理名称 MD5_CD MD5_CD row_eff_ts row_eff_ts
12345 12345 Josh乔什 abcde abcde 2022-03-01 04:14:06 2022-03-01 04:14:06
12346 12346 Mark标记 fghij fghij 2022-03-02 04:14:06 2022-03-02 04:14:06
12345 12345 Josher乔舍 mnopi姆诺皮 2022-03-03 04:14:06 2022-03-03 04:14:06

A couple comments on your question first:首先对您的问题发表几点评论:

  • Daily extract on 2022-03-03 shows row_eff_ts as "2022-03-02" for Mark. 2022-03-03 的每日摘录将 row_eff_ts 显示为 Mark 的“2022-03-02”。 I'm assuming this is a typo and should be "2022-03-03".我假设这是一个错字,应该是“2022-03-03”。
  • I think you're misunderstanding MD5_CD.我认为你误解了 MD5_CD。 According to your link, the hash is meant to be a consistent hash of business keys, which shouldn't change based on user updates.根据您的链接, hash 是一致的 hash 业务密钥,不应根据用户更新而更改。 I think Agent 12345 should still have MD5_CD="abcde".我认为 Agent 12345 应该仍然有 MD5_CD="abcde"。

The query below runs over the full set of daily extracts to produce all of the deltas.下面的查询运行完整的每日提取集以生成所有增量。 You can add additional WHERE clauses over row_eff_ts to restrict it to a given day, if you want to process exactly one day at a time.如果您想一次只处理一天,您可以在 row_eff_ts 上添加额外的 WHERE 子句以将其限制在给定的一天。

-- create sample data
create temp table daily_extracts (
    Agent_Key INT64,
    Agent_name STRING,
    MD5_CD STRING,
    row_eff_ts TIMESTAMP
);

insert daily_extracts (Agent_Key, Agent_name, MD5_CD, row_eff_ts)
VALUES
(12345, "Josh", "abcde", "2022-03-01 04:14:06"),
(12345, "Josh", "abcde", "2022-03-02 04:14:06"),
(12346, "Mark", "fghij", "2022-03-02 04:14:06"),
(12345, "Josher", "mnopi", "2022-03-03 04:14:06"),
(12346, "Mark", "fghij", "2022-03-03 04:14:06");

-- Calculate previous_Agent name. This could be combined with the 
-- query below, but this makes it easier to build the query 
-- incrementally IMO, without copying/pasting SQL as much.
with daily_data_and_prev_name as (
    select 
        *, 
        lag(Agent_name) over 
            (partition by Agent_Key order by row_eff_ts) previous_Agent_name,
    from daily_extracts
)

select * except (previous_Agent_name), 
from daily_data_and_prev_name
where 
  -- This finds name changes.
  Agent_name != previous_Agent_name 
  -- This finds new records.
  or previous_Agent_name is null
order by row_eff_ts
;

Output of the last query:最后查询的 Output:

Row     Agent_Key   Agent_name  MD5_CD  row_eff_ts
1       12345       Josh        abcde   2022-03-01 04:14:06 UTC
2       12346       Mark        fghij   2022-03-02 04:14:06 UTC
3       12345       Josher      mnopi   2022-03-03 04:14:06 UTC

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM