简体繁体 English

Elasticsearch复制其他系统数据？

[英]Elasticsearch replication of other system data?

原文 2015-12-27 02:29:11 3 2 elasticsearch/ architecture

Suppose I want to use elasticsearch to implement a generic search on a website. 假设我想使用elasticsearch在网站上实现通用搜索。 The top search bar would be expected to find resources of all different kinds across the site. 顶部搜索栏将在整个站点中找到所有不同类型的资源。 Documents for sure (uploaded/indexed via tika) but also things like clients, accounts, other people, etc. 文件肯定（通过tika上传/索引），还有客户，帐户，其他人等。

For architectural reasons, most of the non-document stuff (clients, accounts) will exist in a relational database. 出于架构原因，大多数非文档内容（客户端，帐户）将存在于关系数据库中。

When implementing this search, option #1 would be to create document versions of everything, and then just use elasticsearch to run all aspects of the search, relying not at all on the relational database for finding different types of objects. 实现此搜索时，选项＃1将创建所有内容的文档版本，然后只使用elasticsearch运行搜索的所有方面，完全不依赖于关系数据库来查找不同类型的对象。

Option #2 would be to use elasticsearch only for indexing the documents, which would mean for a general "site search" feature, you'd have to farm out multiple searches to multiple systems, then aggregate the results before returning them. 选项＃2将仅使用elasticsearch来索引文档，这意味着一般的“站点搜索”功能，您必须将多个搜索分配到多个系统，然后在返回之前聚合结果。

Option #1 seems far superior, but the downside is that it requires that elastic search in essence have a copy of a great many things in the production relational database, plus that those copies be kept fresh as things change. 选项＃1似乎远远优越，但缺点是它要求弹性搜索本质上在生产关系数据库中有许多东西的副本，并且随着事物的变化，这些副本会保持新鲜。

What's the best option for keeping these stores in sync, and am I correct in thinking that for general search, option #1 is superior? 保持这些商店同步的最佳选择是什么？我认为对于一般搜索，选项＃1更优越吗？ Is there an option #3? 有选项＃3吗？

2 个解决方案

You've pretty much listed the two main options there are when it comes to search across multiple data stores, ie search in one central data store (option #1) or search in all data stores and aggregate the results (option #2). 您已经列出了搜索多个数据存储时的两个主要选项，即在一个中央数据存储中搜索（选项＃1）或在所有数据存储中搜索并聚合结果（选项＃2）。

Both options would work, although option #2 has two main drawbacks: 这两个选项都有效，但选项＃2有两个主要缺点：

It will require a substantial amount of logic to be developed in your application in order to "branch out" the searches to the multiple data stores and aggregate the results you get back. 它需要在您的应用程序中开发大量逻辑，以便将搜索“分支”到多个数据存储并汇总您返回的结果。
The response times might be different for each data store, and thus, you will have to wait for the slowest data store to respond in order to present the search results to the user (unless you circumvent this by using different asynchronous technologies, such as Ajax, websocket, etc) 每个数据存储的响应时间可能不同，因此，您必须等待最慢的数据存储进行响应才能将搜索结果呈现给用户（除非您通过使用不同的异步技术来解决这个问题，例如Ajax ，websocket等）

If you want to provide a better and more reliable search experience, option #1 would clearly get my vote (I take this way most of the time actually). 如果你想提供一个更好，更可靠的搜索体验，选项＃1显然会得到我的投票（我实际上大部分时间采用这种方式）。 As you've correctly stated, the main "drawback" of this option is that you need to keep Elasticsearch in synch with the changes in your other master data stores. 正如您所说，此选项的主要“缺点”是您需要使Elasticsearch与其他主数据存储中的更改保持同步。

Since your other data stores will be relational databases, you have a few different options to keep them in synch with Elasticsearch, namely: 由于您的其他数据存储将是关系数据库，因此您有几个不同的选项可以使它们与Elasticsearch保持同步，即：

using the Logstash JDBC input 使用Logstash JDBC输入
using the JDBC importer tool 使用JDBC导入器工具

These first two options work great but have one main disadvantage, ie they don't capture DELETEs on your table, they will only capture INSERTs and UPDATEs. 前两个选项工作得很好但有一个主要缺点，即它们不捕获表上的DELETE，它们只捕获INSERT和UPDATE。 This means that if you ever delete a user, account, etc, you will not be able to know that you have to delete the corresponding document in Elasticsearch. 这意味着如果您删除了用户，帐户等，您将无法知道必须删除Elasticsearch中的相应文档。 Unless, of course, you decide to delete the Elasticsearch index before each import session. 当然，除非您决定在每次导入会话之前删除Elasticsearch索引。

To alleviate this, you can use another tool which bases itself on the MySQL binlog and will thus be able to capture every event. 为了缓解这个问题，您可以使用另一个基于MySQL binlog的工具，从而能够捕获每个事件。 There's one written in Go , one in Java and one in Python . 有一个用Go编写，一个用Java编写，一个用Python编写。

UPDATE: 更新：

Here is another interesting blog article on the subject: How to keep Elasticsearch synchronized with a relational database using Logstash 这是另一个有趣的博客文章：如何使用Logstash使Elasticsearch与关系数据库保持同步

Please take a look at Debezium . 请看看Debezium 。 It's a change data capture (CDC) platform, which allow you to steam your data 它是一个变更数据捕获（CDC）平台，允许您传输数据

I created a simple github repository , which shows how it works with PostgreSQL and ElasticSearch 我创建了一个简单的github存储库，它显示了它如何与PostgreSQL和ElasticSearch一起使用