简体繁体 English

持续处理来自PostGre数据库的数据-采取什么方法？

[英]Continually process data from a PostGre database - what approach to take?

原文 2015-06-06 16:29:33 1 1 python/ database/ postgresql

Have a question about what sort of approach to take on a process I am trying to structure. 对我要构建的流程采取哪种方法有疑问。 Working with PostgreSQL and Python. 使用PostgreSQL和Python。

Scenario: 场景：

I have two databases A and B. 我有两个数据库A和B。
B is a processed version of A. B是A的经过处理的版本。
Data continually streams into A, which needs to be processed in a certain way (using multi-processing) and is then stored in B. 数据不断流入A，需要以某种方式（使用多重处理）对其进行处理，然后将其存储在B中。
Each new row in A needs to be processed only once. A中的每个新行仅需要处理一次。

So: 所以：

streamofdata ===> [database A] ----> process ----> [database B] streamofdata ===> [数据库A] ---->进程----> [数据库B]

Database A is fairly large (40 GB) and growing. 数据库A很大（40 GB）并且正在增长。 My question is regarding the determination on what is the new data not yet processed and put into B. What is the best way to determine what rows have to be processed still. 我的问题是关于确定尚未处理并放入B中的新数据是什么。确定哪些行仍必须处理的最佳方法是什么。

Matching primary keys each time on what has not yet been processed is not the way to go I am guessing 我猜测每次匹配尚未处理的主键都不是可行的方法

So let's say new rows 120 to 130 come into database A over some time period. 因此，假设在一段时间内，新行120至130进入数据库A。 So my last row processed row was 119. Is a correct approach to look at the last row id (the primary key) 119 processed and say that anything beyond that should now be processed? 所以我的最后一行已处理行是119。是否正确的方法来查看已处理的最后一行ID（主键）119，并说现在应该处理超出此范围的所有内容？

Also wondering whether anyone has any further resources on this sort of 'realtime' processing of data. 还想知道是否有人对这种“实时”数据处理有更多的资源。 Not exactly sure what I am looking for technically speaking. 从技术上讲，我不确定要寻找什么。

1 个解决方案

Well, there are a few ways you could handle this problem. 好了，有几种方法可以解决这个问题。 As a reminder, the process you are describing is basically re-implementing a form of database replication, so you may want to familiarize yourself with the various popular replication options out there for Postgres and how they work, particularly Slony might be of interest to you. 提醒您，您所描述的过程基本上是在重新实现数据库复制的形式，因此您可能需要熟悉Postgres的各种流行复制选项及其工作方式，尤其是Slony可能对您很感兴趣。 You didn't specify what sort of database "database B" is, so I'll assume it's a separate PostgreSQL instance, though that assumption won't change a whole lot about the decisions below other than ruling out some canned solutions like Slony. 您没有指定哪种数据库“数据库B”，因此我假设它是一个单独的PostgreSQL实例，尽管该假设除了排除了诸如Slony之类的罐装解决方案外，不会对下面的决策产生很大的影响。

Set up a FOR EACH ROW trigger on the important table(s) you have in database A which need to be replicated. 在数据库A中需要复制的重要表上设置FOR EACH ROW触发器。 Your trigger would take each new row INSERTed (and/or UPDATEd, DELETEd, if you need to catch those) in those tables and send them off to database B appropriately. 您的触发器将在这些表中插入每行INSERTed（和/或UPDATE，DELETEd，如果需要捕获它们）的每个新行，并将它们适当地发送到数据库B。 You mentioned using Python, so just a reminder you can certainly write these trigger functions in PL/python if that makes life easy for you, ie you should hopefully be able to more-or-less easily tweak your existing code so that it runs inside the database as a PL/Python trigger function. 您提到使用Python，只是提醒您，如果这样可以简化您的工作，您当然可以在PL / python中编写这些触发函数，即希望您可以或多或少地轻松调整现有代码，使其在内部运行数据库作为PL / Python触发函数。
If you read up on Slony, you might have noticed that proposal #1 is very similar to how Slony works -- consider whether it would be easy or helpful for you to have Slony take over the replication of the necessary tables from database A to database B, then if you need to further move/transform the data into other tables inside database B, you might do that with triggers on those tables in database B. 如果您读过Slony，您可能已经注意到建议1与Slony的工作原理非常相似-考虑让Slony接管从数据库A到数据库的必要表的复制是否容易还是有用。 B，然后，如果您需要将数据进一步移动/转换到数据库B内的其他表中，则可以使用数据库B中这些表上的触发器来执行此操作。
Set up a trigger or RULE which will send out a NOTIFY with a payload indicating the row which has changed. 设置一个触发器或规则，它将发出带有有效负载的NOTIFY通知，指示行已更改。 Your code will will LISTEN for these notifications and know immediately which rows have changed. 您的代码将监听这些通知，并立即知道哪些行已更改。 The psycopg2 adapter has good support for LISTEN and NOTIFY. psycopg2适配器对LISTEN和NOTIFY具有良好的支持。 NB you will need to exercise some care to handle the case that your listener code has crashed or gets disconnected from the database or otherwise misses some notifications. 注意：您将需要谨慎处理侦听器代码崩溃或与数据库断开连接，或者错过某些通知的情况。
In case you have control over the code streaming data into database A, you could have that code take over the job of replicating its new data into database B. 如果您可以控制将数据流传输到数据库A中的代码，则可以让该代码接管将其新数据复制到数据库B中的工作。