简体繁体 English

数据流 API - 高可用性

[英]Data streaming API- High availability

原文 2021-01-08 13:27:12 3 2 amazon-web-services/ amazon-ec2/ high-availability/ long-polling/ streaming-analytics

In my architecture on AWS, I have a service running on an EC2 instance which calls Twitter streaming API for data ingestion ie ingestion of real-time tweets.在我在 AWS 上的架构中，我有一个在 EC2 实例上运行的服务，它调用 Twitter 流式传输 API 用于数据摄取，即摄取实时推文。 I call this service TwitterClient.我将此服务称为 TwitterClient。

Twitter API uses a kindof long polling over HTTP protocol to deliver streaming data. Twitter API 使用一种基于 HTTP 协议的长轮询来传递流数据。 The documentation says- a single connection is opened between your app (in my case, TwitterClient) and the API, with new tweets being sent through that connection.文档说——在您的应用程序（在我的例子中是 TwitterClient）和 API 之间打开了一个连接，并通过该连接发送新的推文。

TwitterClient then passes the real-time tweets to the backend (using Kinesis Data streams) for processing. TwitterClient 然后将实时推文传递到后端（使用 Kinesis Data 流）进行处理。

The problem I am facing is- running multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times.我面临的问题是 - 并行运行多个 EC2 实例将导致重复的推文被摄取，并且每条推文都将被处理多次。 However, only one instance of EC2 becomes a single point of failure.但是，只有一个 EC2 实例会成为单点故障。

I cannot afford downtime as I can't miss a single tweet.我无法承受停机时间，因为我不能错过任何一条推文。

What should I do to ensure high availability?我应该怎么做才能确保高可用性？

Edit: Added a brief description of how Twitter API delivers streaming data编辑：添加了 Twitter API 如何传递流数据的简要说明

2 个解决方案

The simplest way to implement this is to run multiple EC2 instances in parallel, in different regions.实现这一点的最简单方法是在不同区域并行运行多个 EC2 实例。 You can certainly get more complex, and use heartbeats between the instances, but this is probably over-engineering.您当然可以变得更复杂，并在实例之间使用心跳，但这可能是过度设计的。

multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times多个 EC2 实例并行将导致重复的推文被摄取，并且每条推文将被处理多次

Tweets have a unique message ID that can be used to deduplicate.推文具有唯一的消息 ID，可用于重复数据删除。

I can't miss a single tweet我不能错过任何一条推文

This is your real problem.这是你真正的问题。 Twitter limits you to a certain number of requests per 15 minute period. Twitter 将您限制为每 15 分钟的一定数量的请求。 Assuming that you have reasonable filter rules (ie, you don't try to read the entire tweetstream, or even the tweetstream for a broad topic), then this should be sufficient to capture all tweets.假设您有合理的过滤规则（即，您不尝试阅读整个推文流，甚至是针对广泛主题的推文流），那么这应该足以捕获所有推文。

However, it may not be sufficient if you're running multiple instances.但是，如果您正在运行多个实例，这可能还不够。 You could try using two API keys (assuming that Twitter allows that) or adjust your polling frequency to something that allows multiple instances to run concurrently.您可以尝试使用两个 API 密钥（假设 Twitter 允许）或将轮询频率调整为允许多个实例同时运行的值。

Beware, however: as far as I know there are no guarantees.但是请注意：据我所知，没有任何保证。 If you need guaranteed access to every relevant tweet, you would need to talk to Twitter (and be prepared to pay them for the privilege).如果您需要保证访问每条相关推文，则需要与 Twitter 交谈（并准备为特权付费）。

You can setup to run 2 EC2 behind a Load Balancer, keeping only one EC2 instance active at a time and other as passive (or backup).您可以设置在负载均衡器后面运行 2 个 EC2，一次只保持一个 EC2 实例处于活动状态，而另一个作为被动（或备份）实例。 2nd will be active when 1st is down. 2nd 将在 1st 关闭时激活。