简体   繁体   English

数据仓库应该成为主要数据存储库吗?

[英]Should a data warehouse ever be a primary data repository?

Typically a data warehouse is a means to consolidate multiple source systems, usually for reporting purposes. 通常,数据仓库是合并多个源系统的一种方法,通常用于报告目的。 But are there any situations where it is a good idea to use the data warehouse as a primary database in its own right (ie capturing and managing data, not loading it from another system)? 但是在任何情况下,最好将数据仓库本身用作主数据库(即捕获和管理数据,而不是从另一个系统加载数据)?

The short answer - no, I don't think so . 简短的回答- 不,我不这么认为 You should first be designing a good, normalized database structure for the production database - and later worry about the data warehouse part. 您应该首先为生产数据库设计一个良好的规范化数据库结构-然后再考虑数据仓库部分。

Keeping the warehouse and the production database separate is considered "best business practice", perhaps most importantly it's a design consideration. 将仓库和生产数据库分开是“最佳业务实践”,也许最重要的是设计上的考虑。 The data warehouse and the production database serves two different purposes. 数据仓库和生产数据库有两个不同的用途。 The perhaps most important function of the production database is to capture transactions reliably, consistently and unambiguously. 生产数据库可能最重要的功能是可靠,一致且明确地捕获事务。 This matters both when designing the database and when choosing the software, ie database engine. 这在设计数据库和选择软件(即数据库引擎)时都很重要。

The design part of it shouldn't be underestimated - I'd say that in most projects, a good database design is one the very first thing one should be doing. 它的设计部分不应被低估-我要说的是,在大多数项目中,良好的数据库设计是人们要做的第一件事。 Getting the table structure right is more important than choosing technology. 正确设置表结构比选择技术更重要。

If you're planning to grow, it may also make good sense to try and isolate the data a bit into different databases - both because it makes it easier to split up the database later and run different parts on different hardware (when you hit the "bleeding edge", it's a lot cheaper to buy more servers than to buy more powerful servers) and because you may later want to split out software modules or reuse them independently of each other. 如果您打算进行扩展,尝试将数据稍微隔离到不同的数据库中也很有意义-两者都是因为这样可以使以后拆分数据库并在不同的硬件上运行不同的部分(当您按“边缘化”,购买更多服务器要比购买功能更强大的服务器便宜得多,并且因为您以后可能想要拆分软件模块或彼此独立地重用它们。 You wouldn't want to split up the data warehouse the same way. 您不想以相同的方式拆分数据仓库。

So to recap, start with the production database, and build a data warehouse later - that can typically be postponed until you feel the urge to combine data from different sources or to add redundancy into the production database to achieve faster reports. 因此,回顾一下,从生产数据库开始,然后再建立数据仓库-通常可以将其推迟,直到您感到有将不同来源的数据合并或在生产数据库中添加冗余以实现更快的报告的渴望。


I was working for an online gambling company for 8 years. 我在一家在线赌博公司工作了8年。 I participated in the design of the production database, and I was later responsible for building the data warehouse - so the comments above are based on some of the mistakes done and experiences learnt. 我参与了生产数据库的设计,后来我负责构建数据仓库-因此,以上评论基于所做的一些错误和经验。

Good recap by @tobixen. @tobixen的精彩回顾。 One more thing about Datawarehouse best practice is you don't modify the data, you generally append the data ( so history is kept). 关于Datawarehouse最佳做法的另一件事是,您不修改数据,通常会追加数据(因此会保留历史记录)。

Most datawarehouse technologies are geared toward this pattern. 大多数数据仓库技术都适用于这种模式。 So using a typical OLTP system or Master Data System over datawarehouse wont work properly. 因此,在数据仓库上使用典型的OLTP系统或主数据系统将无法正常工作。

The general flow is 一般流程是

OLTP Database -> Datawarehouse > Archiving. OLTP数据库->数据仓库>归档。

OLTP Database may retain data for say 1 month to 13 months. OLTP数据库可以将数据保留1个月到13个月。 Datawarehouse will retain data say upto 5 /10 years (not all details, probably summarized) and then goes to archiving. 数据仓库将保留长达5/10年的数据(并非所有详细信息,可能是汇总的),然后进行归档。

Being said that BigData technologies like Hadoop / Hive is making it easier to store large amount of data ( say 10 years) and process them in cost effective fashion. 有人说,像Hadoop / Hive这样的BigData技术使存储大量数据(例如10年)并以具有成本效益的方式对其进行处理变得更加容易。

I would have to say generally not , but ... we are being asked to build a DWH that has as one of it's source systems an "OLTP" system that collects Twitter data. 通常不得不说,但是……我们被要求构建一个DWH,该DWH的源系统之一就是收集Twitter数据的“ OLTP”系统。 In this case I think the Twitter collecting "OLTP" system only keeps 1 week worth of data and each day passes it across to the DWH for storage, then each day we feed summary information back to the Twitter OLTP system. 在这种情况下,我认为Twitter收集的“ OLTP”系统仅保留1周的数据,每天将其传递给DWH进行存储,然后每天我们将摘要信息反馈给Twitter OLTP系统。 In this case we will become the primary data repository for the Twitter data ... to be honest though this is the first time this is the case for us. 在这种情况下,我们将成为Twitter数据的主要数据存储库...老实说,尽管这是我们第一次这样做。

I would disagree with @Sathish Senathi I think on the statement that "Datawarehouse will retain data say upto 5 /10 years (not all details, probably summarized)". 我不同意@Sathish Senathi的观点,我认为“ Datawarehouse将保留长达5/10年的数据(并非所有细节,可能是总结的)”。 I would never summarise data going into a DWH, always keep the lowest grain transaction data, but obviously not all attributes that are in the OLTP system. 我永远不会汇总进入DWH的数据,始终保持最低的谷物交易数据,但显然不是OLTP系统中的所有属性。

I admit to not having worked on online gambling which has very large data, but Ive worked in telecoms and insurance which have multi Terabyte DWHes. 我承认没有从事过海量数据的在线赌博工作,但我曾在电信和保险业工作过,后者拥有数兆兆字节的DWHes。 We always store down to the details of the phone call or premium/claim transactions themselves. 我们始终会保留电话或保险费/理赔交易本身的详细信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM