简体繁体 English

PostgreSQL到数据仓库：近实时ETL /数据提取的最佳方法

[英]PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

原文 2010-03-25 22:45:21 9 3 postgresql/ data-warehouse/ etl/ near-real-time/ data-extraction

Background: 背景：

I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. 我有一个PostgreSQL（v8.3）数据库，它针对OLTP进行了大量优化。

I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse. 我需要半实时地从中提取数据（有些人必然会问半实时意味着什么，答案和我合理的一样频繁，但我会务实，因为基准可以说我们希望每15分钟一次并将其送入数据仓库。

How much data? 多少数据？ At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. 在高峰时段，我们正在谈论每分钟大约80-100k行击中OLTP侧，非高峰时这将大幅下降到15-20k。 The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. 最频繁更新的行每个约64个字节，但有各种表等，因此数据非常多样化，每行最多可达4000个字节。 The OLTP is active 24x5.5. OLTP处于活动状态24x5.5。

Best Solution? 最佳方案？

From what I can piece together the most practical solution is as follows: 从我可以拼凑起来的最实用的解决方案如下：

Create a TRIGGER to write all DML activity to a rotating CSV log file 创建TRIGGER以将所有DML活动写入旋转的CSV日志文件
Perform whatever transformations are required 执行所需的任何转换
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW 使用本机DW数据泵工具将转换后的CSV高效泵入DW

Why this approach? 为什么这种做法？

TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (ie into a CSV) and are relatively easy to write and deploy. TRIGGERS允许选择性表格成为目标而不是系统范围+输出可配置（即成为CSV），并且相对容易编写和部署。 SLONY uses similar approach and overhead is acceptable SLONY使用类似的方法，开销是可以接受的
CSV easy and fast to transform CSV易于快速转换
Easy to pump CSV into the DW 易于将CSV泵入DW

Alternatives considered .... 考虑的替代方案......

Using native logging ( http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html ). 使用本机日志记录（ http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html ）。 Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. 问题是它相对于我需要的看起来非常冗长，并且解析和转换有点棘手。 However it could be faster as I presume there is less overhead compared to a TRIGGER. 然而，它可能更快，因为我认为与TRIGGER相比，开销更少。 Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log) 当然它会使管理员更容易，因为它是系统范围的，但同样，我不需要一些表（一些用于持久存储我不想记录的JMS消息）
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects 直接通过ETL工具（如Talend）查询数据并将其泵入DW ...问题是OLTP模式需要调整以支持这一点并且有许多负面的副作用
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner 使用经过调整/攻击的SLONY - SLONY可以很好地记录日志并将更改迁移到从站，因此概念框架就在那里，但建议的解决方案似乎更简单，更清洁
Using the WAL 使用WAL

Has anyone done this before? 有没有人这样做过？ Want to share your thoughts? 想分享你的想法？

3 个解决方案

Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATE s , and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETE s without a trigger.) Is this what you had in mind when you mentioned Talend? 假设您的感兴趣的表具有（或可以扩充）唯一的，索引的顺序键，那么只需发出SELECT ... FROM table ... WHERE key > :last_max_key并输出就可以获得更好的价值到一个文件，其中last_max_key是最后一次提取的最后一个键值（如果第一次提取则为0）。这种增量的解耦方法避免 在插入数据路径中引入触发延迟 （无论是自定义触发器还是修改后的Slony），并且取决于您的设置可与CPU等（但是，如果你也有跟踪的数量变得更好UPDATE s，而被添加的顺序键，那么你的UPDATE语句应该SET键列，以NULL所以它得到一个新的价值，并得到回升通过下一个提取。你将无法在没有触发器的情况下跟踪DELETE 。）这是你在提到Talend时想到的吗？

I would not use the logging facility unless you cannot implement the solution above ; 除非您无法实施上述解决方案，否则我不会使用日志工具 ; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. 日志记录很可能涉及锁定开销以确保日志行按顺序写入，并且当多个后端写入日志时不会相互重叠/覆盖（检查Postgres源。）锁定开销可能不是灾难性的，但如果没有它，则可以不使用它您可以使用增量SELECT替代方案。 Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous . 此外， 语句记录将淹没任何有用的WARNING或ERROR消息，并且解析本身不会是即时的 。

Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available , in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT / UPDATE / DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT . 除非你愿意解析WALs（包括事务状态跟踪，并且每次升级Postgres时都准备重写代码），否则我也不一定会使用WALs - 也就是说，除非你有额外的硬件可用 ，在这种情况下你可以将WALs运送到另一台机器进行提取 （在第二台机器上你可以无耻地使用触发器 - 甚至语句记录 - 因为无论发生什么都不会影响主机上的INSERT / UPDATE / DELETE性能。）注意性能方面（在主计算机上），除非你可以将日志写入SAN，否则从运行增量SELECT到运行WAL到不同的机器，你会得到相当的性能损失（主要是在文件系统缓存方面）。

if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records. 如果你能想到一个只包含id和'checksum'的'校验和表'，你不仅可以快速选择新记录，还可以快速选择已更改和删除的记录。

the checksum could be a crc32 checksum function you like. 校验和可以是你喜欢的crc32校验和函数。

The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. PostgreSQL中新的ON CONFLICT子句改变了我做许多更新的方式。 I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE. 我将新数据（基于row_update_timestamp）拉入临时表，然后在一个SQL语句中使用ON CONFLICT UPDATE INSERT到目标表中。 If your target table is partitioned then you need to jump through a couple of hoops (ie hit the partition table directly). 如果你的目标表是分区的，那么你需要跳过几个环（即直接点击分区表）。 The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial). ETL可以在加载Temp表（最有可能）或ON CONFLICT SQL（如果是无关紧要的）时发生。 Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement. 与其他“UPSERT”系统（更新，插入零行等）相比，这显示出巨大的速度提升。 In our particular DW environment we don't need/want to accommodate DELETEs. 在我们特定的DW环境中，我们不需要/想要容纳DELETE。 Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money! 查看ON CONFLICT文档 - 它为甲骨文的MERGE提供了运行资金！