简体   繁体   中英

SCD-2 in data modelling: how do I detect changes?

I know the concept of SCD-2 and I'm trying to improve my skills about it doing some practices.

I have the next scenario/experiment:

  1. I'm calling daily to a rest API to extract information about companies. In my initial load to the DB everything is new, so everything is very easy.
  2. Next day I call to the same rest API, which might returns the same companies, but some of them might have (or not) some changes (ie, they changed the size, the profits, the location, ...)

I know SCD-2 might be really simple if the rest API returns just records with changes, but in this case it might returns as well records without changes.

In this scenario, how people detect if the data of a company has changes or not in order to apply SCD-2?, do they compare all the fields?.

Is there any example out there that I can see?

There is no standard SCD-2 nor even a unique concept of it. It is a general term for large number of possible approaches. The only chance is to practice and see what is suitable for your use case.

In any case you must identify the natural key of the dimension and the set of the attributes you want to keep the history.

You may of course make it more complex by the decision to use your own surrogate key .

You mentioned that there are two main types of the interface for the process:

• You get periodically a full set of the dimension data

• You get the “changes only” (aka delta interface)

Paradoxically the former is much simple to handle than the latter.

First of all, in the full dimensional snapshot the natural key holds, contrary to the delta interface (where you may get more changes for one entity).

Additionally you have to handle the case of late change delivery or even the wrong order of changes delivery.

Next important decision is if you expect deletes to occur. This is again trivial in the full interface, you must define some convention, how this information would be passed in the delta interface. Connected is the question whether a previously deleted entity can be reused (ie reappear in the data).

If you support delete/reuse you'll have to thing about how to show them in your dimension table.

In any case you will need some additional columns in the dimension to cover the historical information.

Some implementation use a change_timestamp , some other use validity interval valid_from and valid_to .

Even other implementation claim that additional sequence number is required – so you avoid the trap of more changes with the identical timestamp.

So you see that before you look for some particular implementation you need carefully decide the options above. For example the full and delta interface leads to a completely different implementations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM