[英]Best approach to implement inheritance in a data warehouse based on a postgres database
I am developing a multi-step data pipeline that should optimize the following process: 我正在开发一个多步骤数据管道,该管道应该优化以下过程:
1) Extract data from a NoSQL database ( MongoDB ). 1)从NoSQL数据库( MongoDB )中提取数据。
2) Transform and load the data into a relational ( PostgreSQL ) database. 2)将数据转换并加载到关系( PostgreSQL )数据库中。
3) Build a data warehouse using the Postgres database 3)使用Postgres数据库建立数据仓库
I have manually coded a script to handle steps 1) and 2), which is an intermediate ETL pipeline. 我已经手动编码了一个脚本来处理步骤1)和2),这是一个中间ETL管道。 Now my goal is to build the data warehouse using the Postgres database, but I came across with a few doubts regarding the DW design. 现在,我的目标是使用Postgres数据库构建数据仓库,但是我对DW设计遇到了一些疑问。 Below is the dimensional model for the relational database: 以下是关系数据库的维度模型:
There are 2 main tables, Occurrence and Canonical , from which inherit a set of others (drawn in red and blue, respectively). 有2个主表Occurrence和Canonical ,从中继承了其他几个表(分别以红色和蓝色绘制)。 Note that there are 2 child data types, ObserverNodeOccurrence and CanonicalObserverNode , that have an extra many-to-many relationship with another table. 请注意,有两种子数据类型, ObserverNodeOccurrence和CanonicalObserverNode ,它们与另一个表具有额外的多对多关系。
I made some research regarding how inheritance should be implemented in a data warehouse and figured the best practice would be to merge together the family data types (super and child tables) into a single table . 我对如何在数据仓库中实现继承进行了一些研究 , 发现最佳实践是将族数据类型(父表和子表)合并到一个表中 。 Doing this would imply adding extra attributes and a lot of null values. 这样做将意味着添加额外的属性和许多 空值。 My new dimensional model would look like the following: 我的新尺寸模型如下所示:
Question 1: Do you think this is the best approach to address this problem? 问题1:您认为这是解决此问题的最佳方法吗? If not, what would be? 如果没有,那会是什么?
Question 2: Any software recommendations for on-premise data warehouses? 问题2:对本地数据仓库有什么软件建议? (on-premise is a must since it contains sensitive data) (内部部署是必须的,因为它包含敏感数据)
Usually having fewer tables to join and denormalizing data will improve query performance for data warehouse queries, so they are often considered a good thing. 通常,较少的表可以联接并对数据进行非规范化可以提高数据仓库查询的查询性能,因此通常认为它们是一件好事。
This would suggest your second table design. 这将建议您进行第二个表设计。 NULL values don't occupy any space in a PostgreSQL table, so you need not worry about that. NULL值不占用PostgreSQL表中的任何空间,因此您不必担心。
As described here there are three options to implement inheritance in a relational database. 如此处所述,有三个选项可以在关系数据库中实现继承 。
IMO the only practicable way to be used in data warehouse is the Table-Per-Hierarchy option, which merges all entities in one table. IMO在数据仓库中使用的唯一可行方法是“ 逐层表”选项,该选项将所有实体合并到一个表中。
The reason is not only the performance gain by saving the joins . 原因不仅是通过保存联接来提高性能 。 In data warehouse often the historical view of the data is important. 在数据仓库中,数据的历史视图通常很重要。 Think, how would you model a change in a subtype in some entity? 想想, 您将如何为某个实体的子类型中的更改建模?
An important thing is to define a discriminator column which uniquely defines the source entity. 重要的是定义一个唯一定义源实体的鉴别符列 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.