[英]PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data
Background: 背景:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. 我有一个PostgreSQL(v8.3)数据库,它针对OLTP进行了大量优化。
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse. 我需要半实时地从中提取数据(有些人必然会问半实时意味着什么,答案和我合理的一样频繁,但我会务实,因为基准可以说我们希望每15分钟一次并将其送入数据仓库。
How much data? 多少数据? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k.
在高峰时段,我们正在谈论每分钟大约80-100k行击中OLTP侧,非高峰时这将大幅下降到15-20k。 The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row.
最频繁更新的行每个约64个字节,但有各种表等,因此数据非常多样化,每行最多可达4000个字节。 The OLTP is active 24x5.5.
OLTP处于活动状态24x5.5。
Best Solution? 最佳方案?
From what I can piece together the most practical solution is as follows: 从我可以拼凑起来的最实用的解决方案如下:
Why this approach? 为什么这种做法?
Alternatives considered .... 考虑的替代方案......
Has anyone done this before? 有没有人这样做过? Want to share your thoughts?
想分享你的想法?
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key
with output to a file, where last_max_key
is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATE
s , and the sequential key was added by you, then your UPDATE
statements should SET
the key column to NULL
so it gets a new value and gets picked by the next extraction. You would not be able to track DELETE
s without a trigger.) Is this what you had in mind when you mentioned Talend? 假设您的感兴趣的表具有(或可以扩充)唯一的,索引的顺序键,那么只需发出
SELECT ... FROM table ... WHERE key > :last_max_key
并输出就可以获得更好的价值到一个文件,其中last_max_key
是最后一次提取的最后一个键值(如果第一次提取则为0)。这种增量的解耦方法避免 在插入数据路径中引入触发延迟 (无论是自定义触发器还是修改后的Slony),并且取决于您的设置可与CPU等(但是,如果你也有跟踪的数量变得更好UPDATE
s,而被添加的顺序键,那么你的UPDATE
语句应该SET
键列,以NULL
所以它得到一个新的价值,并得到回升通过下一个提取。你将无法在没有触发器的情况下跟踪DELETE
。)这是你在提到Talend时想到的吗?
I would not use the logging facility unless you cannot implement the solution above ; 除非您无法实施上述解决方案,否则我不会使用日志工具 ; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental
SELECT
alternative. 日志记录很可能涉及锁定开销以确保日志行按顺序写入,并且当多个后端写入日志时不会相互重叠/覆盖(检查Postgres源。)锁定开销可能不是灾难性的,但如果没有它,则可以不使用它您可以使用增量
SELECT
替代方案。 Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous . 此外, 语句记录将淹没任何有用的WARNING或ERROR消息,并且解析本身不会是即时的 。
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available , in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT
/ UPDATE
/ DELETE
performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT
. 除非你愿意解析WALs(包括事务状态跟踪,并且每次升级Postgres时都准备重写代码),否则我也不一定会使用WALs - 也就是说,除非你有额外的硬件可用 ,在这种情况下你可以将WALs运送到另一台机器进行提取 (在第二台机器上你可以无耻地使用触发器 - 甚至语句记录 - 因为无论发生什么都不会影响主机上的
INSERT
/ UPDATE
/ DELETE
性能。)注意性能方面(在主计算机上),除非你可以将日志写入SAN,否则从运行增量SELECT
到运行WAL到不同的机器,你会得到相当的性能损失(主要是在文件系统缓存方面)。
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records. 如果你能想到一个只包含id和'checksum'的'校验和表',你不仅可以快速选择新记录,还可以快速选择已更改和删除的记录。
the checksum could be a crc32 checksum function you like. 校验和可以是你喜欢的crc32校验和函数。
The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. PostgreSQL中新的ON CONFLICT子句改变了我做许多更新的方式。 I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE.
我将新数据(基于row_update_timestamp)拉入临时表,然后在一个SQL语句中使用ON CONFLICT UPDATE INSERT到目标表中。 If your target table is partitioned then you need to jump through a couple of hoops (ie hit the partition table directly).
如果你的目标表是分区的,那么你需要跳过几个环(即直接点击分区表)。 The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial).
ETL可以在加载Temp表(最有可能)或ON CONFLICT SQL(如果是无关紧要的)时发生。 Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement.
与其他“UPSERT”系统(更新,插入零行等)相比,这显示出巨大的速度提升。 In our particular DW environment we don't need/want to accommodate DELETEs.
在我们特定的DW环境中,我们不需要/想要容纳DELETE。 Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money!
查看ON CONFLICT文档 - 它为甲骨文的MERGE提供了运行资金!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.