简体   繁体   English

使用postgres数据库同步elasticsearch和cassandra

[英]Sync elasticsearch & cassandra with postgres database

I want to sync two dependent databases (elasticsearch and casandra) with my parent database: postgres. 我想将两个依赖数据库(elasticsearch和casandra)与我的父数据库同步:postgres。 I am trying to implement a method in this article: https://qafoo.com/blog/086_how_to_synchronize_a_database_with_elastic_search.html . 我正在尝试在本文中实现一个方法: https//qafoo.com/blog/086_how_to_synchronize_a_database_with_elastic_search.html So I came up with 2 methods 所以我想出了两种方法

  1. Sync before updating/inserting data into dependent databases 在更新/插入数据到相关数据库之前同步

      router.put('/account/edit', function(req, res) { syncElasticWithDatabase().then(() => { elastiClient.update({...}); // client for elasticsearch cassandraClient.execute({...}); // client for cassandra req.end(); }) }) 

syncElasticWithDatabase() uses data in updates table (from postgres), this method can be slow since some people would have to wait for syncElasticWithDatabase() to finish. syncElasticWithDatabase()使用updates表中的数据(来自postgres),此方法可能很慢,因为有些人必须等待syncElasticWithDatabase()完成。 I like this method because I leverage sequantial_ids (check article for details). 我喜欢这种方法,因为我利用了sequantial_ids (查看文章了解详情)。 The data is synced before new data comes in, allowing dependencies to catch up and only missed out data will be synced. 在新数据进入之前,数据会同步,从而允许依赖关系赶上,并且只会丢失错过的数据。 Preventing reindex/reinsert unlike options 2 below. 与下面的选项2不同,防止重新索引/重新插入。

  1. Using a backround process (ei: running every 24 hours), I could sync data by selecting "missed out data" from update_error table, which contains data when elasticsearch or cassandra fail. 使用backround进程(ei:每24小时运行一次),我可以通过从update_error表中选择“miss out data”来同步数据, update_error表包含update_error或cassandra失败时的数据。 Here's a rough example 这是一个粗略的例子

      router.put('/account/edit', function(req, res) { psqlClient.query('UPDATE....').then(() => { elastiClient.update({...}); // client for elasticsearch cassandraClient.execute({...}); // client for cassandra }).catch(err => { psqlClient.query('INERT INTO update_error ....') }) }) 

    However this method would require to reindex or reinsert data, because in some cases elasticsearch could insert data while cassandra didn't or either way. 但是,此方法需要重新索引或重新插入数据,因为在某些情况下,弹性搜索可以插入数据,而cassandra则不会或以任何方式插入数据。 Because of this I will need a separate column that will record database type that failed. 因此,我需要一个单独的列来记录失败的数据库类型。 This way I can select data that failed since the last synchronization time for each type of database (ealsticsearch or cassandra). 这样我就可以选择自上次同步时间以来每种类型数据库(ealsticsearch或cassandra)失败的数据。

Questions : 问题

  1. Method 1 seems perfect, but this would mean some people would have to wait for longer than others to update their account due to syncElasticWithDatabase() . 方法1似乎很完美,但这意味着由于syncElasticWithDatabase()某些人必须等待更长时间才能更新其帐户。 However the article above does exactly the same (look at their diagram) or I am misunderstanding something? 然而,上面的文章完全相同(看看他们的图表)或我误解了什么?

  2. Because of the delay described above (if I'm correct), I introduced option 2. However it's just too much in order to sync IMHO. 由于上面描述的延迟(如果我是正确的),我介绍了选项2.然而,为了同步恕我直言,它太多了。 Yet I spent a good time thinking about this... So are there easier or better methods than 1 and 2? 然而,我花了很多时间思考这个......那么有比1和2更简单或更好的方法吗?

  3. Would Apache Zoo Keeper help in my case? Apache Zoo Keeper会帮助我吗?

Thanks :) 谢谢 :)


Other reference 其他参考

Sync elasticsearch on connection with database - nodeJS 在与数据库连接时同步elasticsearch - nodeJS

https://gocardless.com/blog/syncing-postgres-to-elasticsearch-lessons-learned/ https://gocardless.com/blog/syncing-postgres-to-elasticsearch-lessons-learned/

Basically, you'll need to use method described here https://qafoo.com/blog/086_how_to_synchronize_a_database_with_elastic_search.html and insert & select data from one database table. 基本上,您需要使用此处描述的方法https://qafoo.com/blog/086_how_to_synchronize_a_database_with_elastic_search.html并从一个数据库表中插入并选择数据。 But make sure you limit the number of selects when selection data in "updates" eg: LIMIT 100 . 但请确保在“更新”中选择数据时限制选择次数,例如: LIMIT 100

Here's the workflow: 这是工作流程:

  1. save data to "updates" table during insert/update (if delete make sure you mark it as deleted in a column) insert/update期间insert/update数据保存到“更新”表(如果delete确保在列中将其标记为已删除)
  2. then run this process -> 然后运行这个过程 - >

    • select you last insert: sequence_id from elasticsearch or Cassandra 选择最后一次插入:来自elasticsearch或Cassandra的sequence_id
    • use it to select data from "updates" table like so: id > :sequence_id 用它从“更新”表中选择数据,如: id > :sequence_id

You can then insert data (into elasticsearch or cassandra) or do whatever. 然后,您可以插入数据(到elasticsearch或cassandra)或做任何事情。 Make sure you insert data into "updates" table before dependent databases. 确保依赖数据库之前将数据插入“更新”表。 And there is no need to duplicate document_id so replace them with new one. 并且不需要复制document_id因此请用新的替换它们。 This gives consistency and allows you to choose between running cron job or sync it during a specific action all at once. 这提供了一致性,允许您在运行cron作业或在特定操作中同时进行同步时进行选择。 Then update your sequence_id to the last one. 然后将sequence_id更新为最后一个。

I choose to sync data straight after insert/update/delete to "updates", Then I do res.end() (or whatever to finish response) and use sync() function to select 100 new records in ascending order. 我选择在插入/更新/删除后直接将数据同步到“更新”,然后我执行res.end() (或完成响应的任何内容)并使用sync()函数以升序选择100个新记录。 I also run a cron job every 24 hours (without LIMIT 100 ) to make sure any data that was left out will be synced. 我还每24小时运行一次cron作业(没有LIMIT 100 ),以确保任何遗漏的数据都会被同步。 Ohh yeah, and if updates were successful for all databases, then you might as well delete records from "updates" unless you use it for some other reason. 哦,是的,如果所有数据库的更新都成功,那么您也可以从“更新”中删除记录,除非您出于其他原因使用它。 But note that elasticsearch can loose data in memory 但请注意,elasticsearch可以在内存中丢失数据

Good luck :) And I am opened to suggestions 祝你好运:)我对建议持开放态度

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM