简体   繁体   English

数据建模中的 SCD-2:如何检测变化?

[英]SCD-2 in data modelling: how do I detect changes?

I know the concept of SCD-2 and I'm trying to improve my skills about it doing some practices.我知道 SCD-2 的概念,我正在尝试通过一些练习来提高我的技能。

I have the next scenario/experiment:我有下一个场景/实验:

  1. I'm calling daily to a rest API to extract information about companies.我每天都在调用一个 rest API 来提取有关公司的信息。 In my initial load to the DB everything is new, so everything is very easy.在我对数据库的初始加载中,一切都是新的,所以一切都非常简单。
  2. Next day I call to the same rest API, which might returns the same companies, but some of them might have (or not) some changes (ie, they changed the size, the profits, the location, ...)第二天,我调用相同的 rest API,它可能返回相同的公司,但其中一些可能(或没有)发生了一些变化(即,他们改变了规模、利润、位置……)

I know SCD-2 might be really simple if the rest API returns just records with changes, but in this case it might returns as well records without changes.我知道如果其余 API 只返回有变化的记录,SCD-2 可能真的很简单,但在这种情况下,它也可能返回没有变化的记录。

In this scenario, how people detect if the data of a company has changes or not in order to apply SCD-2?, do they compare all the fields?.在这种情况下,人们如何检测公司的数据是否有变化以应用 SCD-2?他们是否比较所有字段?。

Is there any example out there that I can see?有没有我可以看到的例子?

There is no standard SCD-2 nor even a unique concept of it.没有标准的 SCD-2,甚至没有它的独特概念。 It is a general term for large number of possible approaches.它是大量可能方法的总称。 The only chance is to practice and see what is suitable for your use case.唯一的机会是练习,看看什么适合您的用例。

In any case you must identify the natural key of the dimension and the set of the attributes you want to keep the history.在任何情况下,您都必须确定维度的自然键以及要保留历史记录的属性集

You may of course make it more complex by the decision to use your own surrogate key .当然,您可能会因为决定使用自己的代理键而使其变得更加复杂。

You mentioned that there are two main types of the interface for the process:您提到该流程有两种主要类型的界面

• You get periodically a full set of the dimension data • 您会定期获得一整套维度数据

• You get the “changes only” (aka delta interface) • 您获得“仅更改”(又名增量界面)

Paradoxically the former is much simple to handle than the latter.矛盾的是,前者比后者更容易处理。

First of all, in the full dimensional snapshot the natural key holds, contrary to the delta interface (where you may get more changes for one entity).首先,在全维快照中自然键保持不变,这与 delta 接口相反(您可能会为一个实体获得更多更改)。

Additionally you have to handle the case of late change delivery or even the wrong order of changes delivery.此外,您必须处理更改交付延迟甚至更改交付顺序错误的情况

Next important decision is if you expect deletes to occur.下一个重要决定是您是否希望发生删除 This is again trivial in the full interface, you must define some convention, how this information would be passed in the delta interface.这在完整接口中又是微不足道的,您必须定义一些约定,这些信息将如何在 delta 接口中传递。 Connected is the question whether a previously deleted entity can be reused (ie reappear in the data). Connected 是之前删除的实体是否可以重新使用(即重新出现在数据中)的问题。

If you support delete/reuse you'll have to thing about how to show them in your dimension table.如果您支持删除/重用,则必须考虑如何在维度表中显示它们。

In any case you will need some additional columns in the dimension to cover the historical information.在任何情况下,您都需要维度中的一些附加列来覆盖历史信息。

Some implementation use a change_timestamp , some other use validity interval valid_from and valid_to .一些实现使用change_timestamp ,另一些实现使用有效间隔valid_fromvalid_to

Even other implementation claim that additional sequence number is required – so you avoid the trap of more changes with the identical timestamp.甚至其他实现都声称需要额外的序列号——因此您可以避免使用相同时间戳进行更多更改的陷阱。

So you see that before you look for some particular implementation you need carefully decide the options above.因此,您会看到,在寻找某些特定实现之前,您需要仔细确定上述选项。 For example the full and delta interface leads to a completely different implementations.例如fulldelta接口导致完全不同的实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM