简体繁体 English

对于 Elasticsearch 和 RabbitMQ，将数据导入 S3 的最佳方法是什么？

[英]What is the best approach to getting data into S3 for Elasticsearch and RabbitMQ?

原文 2020-08-20 23:49:21 4 1 python/ elasticsearch/ rabbitmq/ snowflake-cloud-data-platform/ amazon-kinesis-firehose

In my company we developed a few games for which for some games the events are being sent to either Elasticsearch and others to RabbitMQ.在我的公司，我们开发了一些游戏，其中一些游戏的事件被发送到 Elasticsearch 和其他到 RabbitMQ。 We have a local CLI which grabs the data from both, compiles the messages into compressed (Gzip) JSON files after which another CLI converts them to SQL statements and throws them into a local SQL Server.我们有一个本地 CLI，它从两者中获取数据，将消息编译为压缩 (Gzip) JSON 文件，然后另一个 CLI 将它们转换为 SQL 语句并将它们扔到本地 Z9778840A01012B30BCA28 服务器中。 We want now to scale up but the current setup is painful and nowhere near real-time for analysis.我们现在想扩大规模，但目前的设置很痛苦，而且离实时分析还差得很远。

I've recently built an application in Python which I was planning to publish to a docker container in AWS.我最近在 Python 中构建了一个应用程序，我计划将其发布到 AWS 中的 docker 容器中。 The script grabs data from Elasticsearch, compiles into small compressed JSONS and publishes to an S3 bucket.该脚本从 Elasticsearch 抓取数据，编译成小型压缩 JSONS 并发布到 S3 存储桶。 From there the data is ingested into Snowflake for analysis.从那里数据被摄取到雪花中进行分析。 So far I was able to get the data in quite quickly and looks promising as an alternative.到目前为止，我能够很快地获取数据，并且看起来很有希望作为替代方案。

I was planning to do something similar with RabbitMQ but I wanted to find an even better alternative which would allow this ingestion process to happen seamlessly and help me avoid having to implement within the python code all sorts of exception calls.我计划用 RabbitMQ 做类似的事情，但我想找到一个更好的替代方案，它可以让这个摄取过程无缝地发生，并帮助我避免在 python 代码中实现各种异常调用。

I've researched a bit and found there might be a way to link RabbitMQ to Amazon Kinesis Firehose.我进行了一些研究，发现可能有一种方法可以将 RabbitMQ 链接到 Amazon Kinesis Firehose。 My question would be: How would I send the stream from RabbitMQ to Kinesis?我的问题是：如何将 stream 从 RabbitMQ 发送到 Kinesis？
For Elasticsearch, what is the best way to achieve this?对于 Elasticsearch，实现这一目标的最佳方法是什么？ I've read about the logstash plugin for S3 ( https://www.elastic.co/guide/en/logstash/current/plugins-outputs-s3.html ) and about logstash plugin for kinesis ( https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kinesis.html ). I've read about the logstash plugin for S3 ( https://www.elastic.co/guide/en/logstash/current/plugins-outputs-s3.html ) and about logstash plugin for kinesis ( https://www. elastic.co/guide/en/logstash/current/plugins-inputs-kinesis.html ）。 Which approach would be ideal for real-time ingestion?哪种方法最适合实时摄取？

1 个解决方案

My answer will be very theotic and need to be adapted tested in real world and adapted to your use case.我的回答将非常有神论，需要在现实世界中进行调整测试并适应您的用例。 For a near realtime behaviour, I would use logstash对于近乎实时的行为，我会使用logstash

with elasticsearch input and a short cron.带有 elasticsearch 输入和一个短 cron。 this post can help https://serverfault.com/questions/946237/logstashs-elasticsearch-input-plugin-should-be-used-to-output-to-elasticsearch这篇文章可以帮助https://serverfault.com/questions/946237/logstashs-elasticsearch-input-plugin-should-be-used-to-output-to-elasticsearch
S3 output (support gzip) S3 output（支持gzip）
maybe jdbc output to your DB也许 jdbc output 到您的数据库
RabbitMq output plugin RabbitMq output 插件

You can create more scallable archi by output to RabbitMQ and use other pipeline to listen to the queue and execute other tasks.您可以通过 output 到 RabbitMQ 创建更多可扩展架构，并使用其他管道侦听队列并执行其他任务。