简体   繁体   English

在基于postgres数据库的数据仓库中实现继承的最佳方法

[英]Best approach to implement inheritance in a data warehouse based on a postgres database

I am developing a multi-step data pipeline that should optimize the following process: 我正在开发一个多步骤数据管道,该管道应该优化以下过程:

1) Extract data from a NoSQL database ( MongoDB ). 1)从NoSQL数据库( MongoDB )中提取数据。

2) Transform and load the data into a relational ( PostgreSQL ) database. 2)将数据转换并加载到关系( PostgreSQL )数据库中。

3) Build a data warehouse using the Postgres database 3)使用Postgres数据库建立数据仓库

I have manually coded a script to handle steps 1) and 2), which is an intermediate ETL pipeline. 我已经手动编码了一个脚本来处理步骤1)和2),这是一个中间ETL管道。 Now my goal is to build the data warehouse using the Postgres database, but I came across with a few doubts regarding the DW design. 现在,我的目标是使用Postgres数据库构建数据仓库,但是我对DW设计遇到了一些疑问。 Below is the dimensional model for the relational database: 以下是关系数据库的维度模型:

在此处输入图片说明

There are 2 main tables, Occurrence and Canonical , from which inherit a set of others (drawn in red and blue, respectively). 有2个主表OccurrenceCanonical ,从中继承了其他几个表(分别以红色和蓝色绘制)。 Note that there are 2 child data types, ObserverNodeOccurrence and CanonicalObserverNode , that have an extra many-to-many relationship with another table. 请注意,有两种子数据类型, ObserverNodeOccurrenceCanonicalObserverNode ,它们与另一个表具有额外的多对多关系。

I made some research regarding how inheritance should be implemented in a data warehouse and figured the best practice would be to merge together the family data types (super and child tables) into a single table . 我对如何在数据仓库中实现继承进行了一些研究发现最佳实践是将族数据类型(父表和子表)合并到一个表中 Doing this would imply adding extra attributes and a lot of null values. 这样做将意味着添加额外的属性和许多 值。 My new dimensional model would look like the following: 我的新尺寸模型如下所示:

在此处输入图片说明

Question 1: Do you think this is the best approach to address this problem? 问题1:您认为这是解决此问题的最佳方法吗? If not, what would be? 如果没有,那会是什么?

Question 2: Any software recommendations for on-premise data warehouses? 问题2:对本地数据仓库有什么软件建议? (on-premise is a must since it contains sensitive data) (内部部署是必须的,因为它包含敏感数据)

Usually having fewer tables to join and denormalizing data will improve query performance for data warehouse queries, so they are often considered a good thing. 通常,较少的表可以联接并对数据进行非规范化可以提高数据仓库查询的查询性能,因此通常认为它们是一件好事。

This would suggest your second table design. 这将建议您进行第二个表设计。 NULL values don't occupy any space in a PostgreSQL table, so you need not worry about that. NULL值不占用PostgreSQL表中的任何空间,因此您不必担心。

As described here there are three options to implement inheritance in a relational database. 如此处所述有三个选项可以在关系数据库中实现继承

IMO the only practicable way to be used in data warehouse is the Table-Per-Hierarchy option, which merges all entities in one table. IMO在数据仓库中使用的唯一可行方法是“ 逐层表”选项,该选项将所有实体合并到一个表中。

The reason is not only the performance gain by saving the joins . 原因不仅是通过保存联接来提高性能 In data warehouse often the historical view of the data is important. 数据仓库中,数据的历史视图通常很重要。 Think, how would you model a change in a subtype in some entity? 想想, 您将如何为某个实体的子类型中的更改建模?

An important thing is to define a discriminator column which uniquely defines the source entity. 重要的是定义一个唯一定义源实体的鉴别符列

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 PostgreSQL到数据仓库:近实时ETL /数据提取的最佳方法 - PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data 使用 NodeJS、Postgres 显示摘要数据的最佳方法是什么? - What is the best approach to display the summary data using NodeJS, Postgres? 如何处理数据仓库(PostgreSQL)文档? - How to approach data warehouse (PostgreSQL) documentation? 测试数据仓库性能的最佳方法? - Best way to test performance of a data warehouse? 将旧数据库数据回填到现有 Postgres 数据库的最佳方法是什么? - What is the best way to backfill old database data to an existing Postgres database? 如何使用 Postgraphile 或替代方案在 Postgres 中实现基于复杂权限的数据访问 - How to implement complex permission based data access in Postgres with Postgraphile or alternatives 支持动态数据库的最佳方法 - Best Approach to support dynamic database 访问数据库中平均静态数据的最佳方法(Hibernate,Postgres) - Best way to access averaged static data in a Database (Hibernate, Postgres) Postgres Inheritance 基于分区扫描所有分区 - Postgres Inheritance based partition scanning all the partitions Postgres 数据库继承,子表上的索引 - Postgres database inheritance, indexes on child tables
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM