简体   繁体   中英

Data warehouse updating data

I am currently designing a star schema based warehouse and have some questions on techniques to process future and past data.

Some events in the source system can also be for the future. For eg, an employee is applying for leave for future. The business wants to see the future data for planning but by nature this can change.

  • Q1: Do you bring in the future data in the warehouse?
  • Q2: How do you manage the updates when it changes?

In a similar way if the past data changes, for eg, a sale is amended due to a mistake few days later how do you handle that in the warehouse?

Looking at it as "past" and "future" data is a bit misleading - because as you have said, there are good reasons that either type of data may need to be updated after initial upload to the data warehouse.

I suggest thinking about this data as "planned" and "actual" leave taken, instead. Hopefully by doing so, it becomes clearer that both types may be relevant to load, and later update, in a data warehouse.

This is because reporting and analysis may be required for both planned and actual leave (so loading both types into the DW is relevant). Also, your planned leave may change, and your actual leave may need to be corrected in the source system after the initial upload (so updating both types in the DW is also relevant).

Should planned leave data go into the data warehouse?

This is subjective, and depends entirely on your use cases.

In broad terms, the purpose of a data warehouse is to efficiently store and query large amounts of data. In practice, this is often for the purpose of business reporting (eg month end, year end) and analytics.

So, whether planned leave data is relevant to the above depends on the context of your organisation and users, and an understanding of what business value there is (or not) in storing that data in the data warehouse.

How do you manage the updates when source data changes?

Have a read of this blog post by James Serra . Although it's a bit dated (posted in 2011), overall the concepts are still current and it explains some key concepts really well.

From the article, there are two approaches to loading data into the data warehouse:

  1. Full Extraction: All the data is extracted completely from the source system. Because this extraction reflects all the data currently available on the source system, there is no need to keep track of changes to the source data since the last successful extraction.
  2. Incremental Extraction: Only the data that has changed from a specific point in time in history will be extracted. This point in time may be the time of the last extraction, or a business event like the last day of a fiscal period. To identify this delta change, there must be the possibility to identify all the changed information since this specific point in time.

Full extraction is simple, but inefficient for large volumes of data.

Incremental extraction is more efficient, but requires a way to identify the delta - ie the entries in the source data that are new, or have changed or deleted since the last upload. James' article outlines some approaches to this. This article on change tracking in SQL Server may also be helpful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM