简体   繁体   English

将 postgreSql 数据与 ElasticSearch 同步

[英]Sync postgreSql data with ElasticSearch

Ultimately I want to have a scalable search solution for the data in PostgreSql.最终我想为 PostgreSql 中的数据提供一个可扩展的搜索解决方案。 My finding points me towards using Logstash to ship write events from Postgres to ElasticSearch, however I have not found a usable solution.我的发现使我倾向于使用 Logstash 将写入事件从 Postgres 传送到 ElasticSearch,但是我还没有找到可用的解决方案。 The soluions I have found involve using jdbc-input to query all data from Postgres on an interval, and the delete events are not captured.我发现的解决方案涉及使用 jdbc-input 以间隔查询来自 Postgres 的所有数据,并且不会捕获删除事件。

I think this is a common use case so I hope you guys could share with me your experience, or give me some pointers to proceed.我认为这是一个常见的用例,所以我希望你们可以与我分享您的经验,或者给我一些指示以继续。

If you need to also be notified on DELETEs and delete the respective record in Elasticsearch, it is true that the Logstash jdbc input will not help.如果您还需要在 DELETE 上得到通知并删除 Elasticsearch 中的相应记录,那么 Logstash jdbc 输入确实无济于事。 You'd have to use a solution working around the binlog as suggested here您必须按照此处的建议使用解决 binlog 的解决方案

However, if you still want to use the Logstash jdbc input, what you could do is simply soft-delete records in PostgreSQL, ie create a new BOOLEAN column in order to mark your records as deleted .但是,如果您仍然想使用 Logstash jdbc 输入,您可以做的只是在 PostgreSQL 中软删除记录,即创建一个新的 BOOLEAN 列以将您的记录标记为deleted The same flag would then exist in Elasticsearch and you can exclude them from your searches with a simple term query on the deleted field.然后 Elasticsearch 中将存在相同的标志,您可以使用对deleted字段的简单term查询将它们从搜索中排除。

Whenever you need to perform some cleanup, you can delete all records flagged deleted in both PostgreSQL and Elasticsearch.每当您需要执行一些清理时,您都可以删除 PostgreSQL 和 Elasticsearch 中标记为deleted删除的所有记录。

You can also take a look at PGSync .您还可以查看PGSync

It's similar to Debezium but a lot easier to get up and running.它类似于 Debezium,但更容易启动和运行。

PGSync is a Change data capture tool for moving data from Postgres to Elasticsearch. PGSync 是一个变更数据捕获工具,用于将数据从 Postgres 移动到 Elasticsearch。 It allows you to keep Postgres as your source-of-truth and expose structured denormalized documents in Elasticsearch.它允许您将 Postgres 作为真实来源并在 Elasticsearch 中公开结构化的非规范化文档。

You simply define a JSON schema describing the structure of the data in Elasticsearch.您只需定义一个 JSON 模式来描述 Elasticsearch 中的数据结构。

Here is an example schema: (you can also have nested objects)这是一个示例架构:(您也可以有嵌套对象)

eg例如

{ "nodes": { "table": "book", "columns": [ "isbn", "title", "description" ] } }

PGsync generates queries for your document on the fly. PGsync 会即时为您的文档生成查询。 No need to write queries like Logstash.无需像 Logstash 那样编写查询。 It also supports and tracks deletion operations.它还支持和跟踪删除操作。

It operates both a polling and an event-driven model to capture changes made to date and notification for changes that occur at a point in time.它同时运行轮询和事件驱动模型来捕获迄今为止所做的更改并通知某个时间点发生的更改。 The initial sync polls the database for changes since the last time the daemon was run and thereafter event notification (based on triggers and handled by the pg-notify) for changes to the database.初始同步轮询数据库自上次运行守护程序以来的更改,然后是事件通知(基于触发器并由 pg-notify 处理)以了解数据库的更改。

It has very little development overhead.它的开发开销非常小。

  • Create a schema as described above如上所述创建架构
  • Point pgsync at your Postgres database and Elasticsearch cluster将 pgsync 指向您的 Postgres 数据库和 Elasticsearch 集群
  • Start the daemon.启动守护进程。

You can easily create a document that includes multiple relations as nested objects.您可以轻松创建包含多个关系作为嵌套对象的文档。 PGSync tracks any changes for you. PGSync 会为您跟踪任何更改。

Have a look at the github repo for more details.查看github repo 以获取更多详细信息。

You can install the package from PyPI你可以从PyPI安装包

Please take a look at Debezium .请看一下Debezium It's a change data capture (CDC) platform, which allow you to steam your data这是一个变更数据捕获 (CDC) 平台,可让您传输数据

I created a simple github repository , which shows how it works我创建了一个简单的github 存储库,它展示了它是如何工作的

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM