简体   繁体   English

事实表分区:如何处理ETL中的更新?

[英]Fact table partitioning: how to handle updates in ETL?

We are trying to implement table partitioning for a Data Warehouse Fact table which contains approximately 400M rows. 我们正在尝试为包含大约400M行的数据仓库事实表实现表分区。 Our ETL takes data from source system 50 days backwards (new rows, modified rows, based on source system timestamp) from the previous load. 我们的ETL从前一个加载中向后50天(源行系统时间戳的新行,修改行)从源系统获取数据。 So in every ETL cycle there are new rows coming in, and also old rows which are updating the corresponding rows in the Fact table. 因此,在每个ETL循环中都会有新行,以及更新Fact表中相应行的旧行。 The idea is to insert new rows into the Fact table and update modified rows. 我们的想法是在Fact表中插入新行并更新修改后的行。

The partition column would be date (int, YYYYMMDD) and we are considering to partition by month. 分区列是date(int,YYYYMMDD),我们正在考虑按月分区。

As far as I'm concerned, table partitioning would ease our inserts via fast partition switch operations . 就我而言,表分区可以通过快速分区切换操作简化我们的插入。 We could split the most recent partition to create a new free partition, load new rows into a staging table (using date constraint, eg for the most recent month) and then use partition switch operation to "move" new rows into the partitioned Fact table. 我们可以拆分最近的分区以创建新的空闲分区,将新行加载到临时表(使用日期约束,例如最近一个月),然后使用分区切换操作将新行“移动”到分区的事实表中。 But how can we handle the modified rows which should update the corresponding rows in the Fact table? 但是我们如何处理应该更新Fact表中相应行的修改行? Those rows can contain data from the previous month(s). 这些行可以包含上个月的数据。 Does partition switch help here? 分区切换有帮助吗? Usually INSERT and UPDATE rows are determined by an ETL tool (eg SSIS in our case) or by MERGE statement. 通常INSERTUPDATE行由ETL工具(例如我们的SSIS)或MERGE语句确定。 How partitioning works in these kind of situations? 分区在这种情况下如何工作?

I'd take another look at the design and try to figure out if there's a way around the updates. 我再看一下设计,试着弄清楚是否有更新的方法。 Here are a few implications of updating the fact table: 以下是更新事实表的一些含义:

Performance: Updates are fully logged transactions. 性能:更新是完全记录的事务。 Big fact tables also have lots of data to read and write. 大事实表也有大量的数据可供读写。

Cubes: Updating the fact table requires reprocessing the affected partitions. 多维数据集:更新事实表需要重新处理受影响的分区。 As your fact table continues to grow, the cube processing time will continue to as well. 随着事实表的不断增长,立方体处理时间也将继续增加。

Budget: Fast storage is expensive. 预算:快速存储是昂贵的。 Updating big fact tables will require lots of fast reads and writes. 更新大事实表将需要大量快速读写。

Purist theory: You should not change the fact table unless the initial value was an error (ie the user entered $15,000 instead of $1,500). 纯粹理论:除非初始值是错误(即用户输入15,000美元而不是1,500美元),否则不应更改事实表。 Any non-error scenario will be changing the originally recorded transaction. 任何非错误情况都将改变最初记录的事务。

What is changing? 有什么变化? Are the changing pieces really attributes of a dimension? 变化的部分真的是维度的属性吗? If so, can they be moved to a dimension and have changes handled with a Slowly Changing Dimension type task? 如果是这样,是否可以将它们移动到维度并使用“缓慢变化的维度”类型任务处理更改?

Another possibility, can this be accomplished via offsetting transactions? 另一种可能性,这可以通过抵消交易来实现吗? Example: 例:

The initial InvoiceAmount was $10.00. 最初的InvoiceAmount是10.00美元。 Accounting later added $1.25 for tax then billed the customer for $11.25. 会计后来增加了1.25美元的税收,然后向客户收取11.25美元的费用。 Rather than updating the value to $11.25, insert a record for $1.25. 而不是将值更新为$ 11.25,插入1.25美元的记录。 The sum amount for the invoice will still be $11.25 and you can do a minimally logged insert rather than a fully logged update to accomplish. 发票的总金额仍然是11.25美元,您可以执行最少日志记录的插入而不是完全记录的更新来完成。

Not only is updating the fact table a bad idea in theory, it gets very expensive and non-scalable as the fact table grows. 不仅在理论上更新事实表是一个坏主意,它随着事实表的增长而变得非常昂贵且不可扩展。 You'll be reading and writing more data, requiring more IOPS from the storage subsytem. 您将阅读和编写更多数据,从存储子系统中需要更多IOPS。 When you get ready to do analytics, cube processing will then throw in more problems. 当您准备好进行分析时,立方体处理会引发更多问题。

You'll also have to constantly justify to management why you need so many IOPS for the data warehouse. 您还必须经常证明管理层为什么需要为数据仓库提供如此多的IOPS。 Is there business value/justification in needing all of those IOPS for your constant changing "fact" table? 是否有商业价值/理由需要所有这些IOPS用于不断变化的“事实”表?

If you can't find a way around updates on the fact table, at least establish a cut-off point where the data is determined read-only. 如果您无法找到事实表上的更新方法,那么至少要建立一个以只读方式确定数据的截止点。 Otherwise, you'll never be able to scale. 否则,你永远无法扩展。

Switching does not help here. 切换在这里没有帮助。

Maybe you can execute updates concurrently using multiple threads on distinct ranges of rows. 也许您可以在不同的行范围内使用多个线程同时执行更新。 That might speed it up. 这可能会加速它。 Be careful not to trigger lock escalation so you get good concurrency. 注意不要触发锁升级,以便获得良好的并发性。

Also make sure that you update the rows mostly in ascending sort order of the clustered index. 还要确保主要按聚簇索引的升序排序更新行。 This helps with disk IO (this technique might not work well with multi-threading). 这有助于磁盘IO(这种技术可能不适用于多线程)。

There are as many reasons to update a fact record as there are non-identifying attributes in the fact. 更新事实记录的原因有很多,因为事实中存在非标识属性。 Unless you plan on a "first delete" then "insert", you simply cannot avoid updates. 除非您计划“先删除”然后“插入”,否则您无法避免更新。 You cannot simply say "record the metric deltas as new facts". 您不能简单地说“将度量标准增量记录为新事实”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM